diff mbox series

[v3] CODING_STYLE: Document how to handle unexpected conditions

Message ID 20191209112954.124169-1-george.dunlap@citrix.com (mailing list archive)
State Superseded
Headers show
Series [v3] CODING_STYLE: Document how to handle unexpected conditions | expand

Commit Message

George Dunlap Dec. 9, 2019, 11:29 a.m. UTC
It's not always clear what the best way is to handle unexpected
conditions: whether with ASSERT(), domain_crash(), BUG_ON(), or some
other method.  All methods have a risk of introducing security
vulnerabilities and unnecessary instabilities to production systems.

Provide guidelines for different options and when to use them.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
v3:
- A number of minor edits
- Expand on domain_crash a bit.
v2:
- Clarify meaning of "or" clause
- Add domain_crash as an option
- Make it clear that ASSERT() is not an error handling mechanism.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
---
 CODING_STYLE | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

Comments

Jan Beulich Dec. 9, 2019, 1:50 p.m. UTC | #1
On 09.12.2019 12:29, George Dunlap wrote:
> --- a/CODING_STYLE
> +++ b/CODING_STYLE
> @@ -133,3 +133,97 @@ the end of files.  It should be:
>   * indent-tabs-mode: nil
>   * End:
>   */
> +
> +Handling unexpected conditions
> +------------------------------
> +
> +GUIDELINES:
> +
> +Passing errors up the stack should be used when the caller is already
> +expecting to handle errors, and the state when the error was
> +discovered isn’t broken, or isn't too hard to fix.
> +
> +domain_crash() should be used when passing errors up the stack is too
> +difficult, and/or when fixing up state of a guest is impractical, but
> +where fixing up the state of Xen will allow Xen to continue running.
> +This is particularly appropriate when the guest is exhibiting behavior
> +well-behaved guest should.

DYM "shouldn't"?

> +BUG_ON() should be used when you can’t pass errors up the stack, and
> +either continuing or crashing the guest would likely cause an
> +information leak or privilege escalation vulnerability.
> +
> +ASSERT() IS NOT AN ERROR HANDLING MECHANISM.  ASSERT is a way to move
> +detection of a bug earlier in the programming cycle; it is a
> +more-noticeable printk.  It should only be added after one of the
> +other three error-handling mechanisms has been evaluated for
> +reliability and security.
> +
> +RATIONALE:
> +
> +It's frequently the case that code is written with the assumption that
> +certain conditions can never happen.  There are several possible
> +actions programmers can take in these situations:
> +
> +* Programmers can simply not handle those cases in any way, other than
> +perhaps to write a comment documenting what the assumption is.
> +
> +* Programmers can try to handle the case gracefully -- fixing up
> +in-progress state and returning an error to the user.
> +
> +* Programmers can crash the guest.
> +
> +* Programmers can use ASSERT(), which will cause the check to be
> +executed in DEBUG builds, and cause the hypervisor to crash if it's
> +violated
> +
> +* Programmers can use BUG_ON(), which will cause the check to be
> +executed in both DEBUG and non-DEBUG builds, and cause the hypervisor
> +to crash if it's violated.
> +
> +In selecting which response to use, we want to achieve several goals:
> +
> +- To minimize risk of introducing security vulnerabilities,
> +  particularly as the code evolves over time
> +
> +- To efficiently spend programmer time
> +
> +- To detect violations of assumptions as early as possible
> +
> +- To minimize the impact of bugs on production use cases
> +
> +The guidelines above attempt to balance these:
> +
> +- When the caller is expecting to handle errors, and there is no
> +broken state at the time the unexpected condition is discovered, or
> +when fixing the state is straightforward, then fixing up the state and
> +returning an error is the most robust thing to do.  However, if the
> +caller isn't expecting to handle errors, or if the state is difficult
> +to fix, then returning an error may require extensive refactoring,
> +which is not a good use of programmer time when they're certain that
> +this condition cannot occur.
> +
> +- BUG_ON() will stop all hypervisor action immediately.  In situations
> +where continuing might allow an attacker to escalate privilege, a
> +BUG_ON() can change a privilege escalation or information leak into a
> +denial-of-service (an improvement).  But in situations where
> +continuing (say, returning an error) might be safe, then BUG_ON() can
> +change a benign failure into denial-of-service (a degradation).
> +
> +- domain_crash() is similar to BUG_ON(), but with a more limited
> +effect: it stops that domain immediately.  In situations where
> +continuing might cause guest or hypervisor corruption, but destroying
> +the guest allows the hypervisor to continue, this can change a more
> +serious bug into a guest denial-of-service.  But in situations where
> +returning an error might be safe, then domain_crash() can change a
> +benign failure into a guest denial-of-service.

Perhaps further put emphasis on the call tree still getting unwound
normally, which may imply further actions on the (now dying) domain
taken. Unfortunately it's not unusual for people to forget this; I
think the IOMMU code in particular was (hopefully isn't so much
anymore) a "good" example of this.

Jan
George Dunlap Dec. 10, 2019, 10:56 a.m. UTC | #2
On 12/9/19 1:50 PM, Jan Beulich wrote:
> On 09.12.2019 12:29, George Dunlap wrote:
>> --- a/CODING_STYLE
>> +++ b/CODING_STYLE
>> @@ -133,3 +133,97 @@ the end of files.  It should be:
>>   * indent-tabs-mode: nil
>>   * End:
>>   */
>> +
>> +Handling unexpected conditions
>> +------------------------------
>> +
>> +GUIDELINES:
>> +
>> +Passing errors up the stack should be used when the caller is already
>> +expecting to handle errors, and the state when the error was
>> +discovered isn’t broken, or isn't too hard to fix.
>> +
>> +domain_crash() should be used when passing errors up the stack is too
>> +difficult, and/or when fixing up state of a guest is impractical, but
>> +where fixing up the state of Xen will allow Xen to continue running.
>> +This is particularly appropriate when the guest is exhibiting behavior
>> +well-behaved guest should.
> 
> DYM "shouldn't"?

Indeed.
>> +- domain_crash() is similar to BUG_ON(), but with a more limited
>> +effect: it stops that domain immediately.  In situations where
>> +continuing might cause guest or hypervisor corruption, but destroying
>> +the guest allows the hypervisor to continue, this can change a more
>> +serious bug into a guest denial-of-service.  But in situations where
>> +returning an error might be safe, then domain_crash() can change a
>> +benign failure into a guest denial-of-service.
> 
> Perhaps further put emphasis on the call tree still getting unwound
> normally, which may imply further actions on the (now dying) domain
> taken. Unfortunately it's not unusual for people to forget this; I
> think the IOMMU code in particular was (hopefully isn't so much
> anymore) a "good" example of this.

Can you expand on this?  Do you mean to advise that care should be taken
when returning up the callstack that the domain which was running before
may now be dying, and to behave appropriately?

 -George
Jan Beulich Dec. 10, 2019, 1:46 p.m. UTC | #3
On 10.12.2019 11:56, George Dunlap wrote:
> On 12/9/19 1:50 PM, Jan Beulich wrote:
>> On 09.12.2019 12:29, George Dunlap wrote:
>>> --- a/CODING_STYLE
>>> +++ b/CODING_STYLE
>>> @@ -133,3 +133,97 @@ the end of files.  It should be:
>>>   * indent-tabs-mode: nil
>>>   * End:
>>>   */
>>> +
>>> +Handling unexpected conditions
>>> +------------------------------
>>> +
>>> +GUIDELINES:
>>> +
>>> +Passing errors up the stack should be used when the caller is already
>>> +expecting to handle errors, and the state when the error was
>>> +discovered isn’t broken, or isn't too hard to fix.
>>> +
>>> +domain_crash() should be used when passing errors up the stack is too
>>> +difficult, and/or when fixing up state of a guest is impractical, but
>>> +where fixing up the state of Xen will allow Xen to continue running.
>>> +This is particularly appropriate when the guest is exhibiting behavior
>>> +well-behaved guest should.
>>
>> DYM "shouldn't"?
> 
> Indeed.

(Btw, noticing only now - there's also either an "a" missing, or it
wants to be "guests".)

>>> +- domain_crash() is similar to BUG_ON(), but with a more limited
>>> +effect: it stops that domain immediately.  In situations where
>>> +continuing might cause guest or hypervisor corruption, but destroying
>>> +the guest allows the hypervisor to continue, this can change a more
>>> +serious bug into a guest denial-of-service.  But in situations where
>>> +returning an error might be safe, then domain_crash() can change a
>>> +benign failure into a guest denial-of-service.
>>
>> Perhaps further put emphasis on the call tree still getting unwound
>> normally, which may imply further actions on the (now dying) domain
>> taken. Unfortunately it's not unusual for people to forget this; I
>> think the IOMMU code in particular was (hopefully isn't so much
>> anymore) a "good" example of this.
> 
> Can you expand on this?  Do you mean to advise that care should be taken
> when returning up the callstack that the domain which was running before
> may now be dying, and to behave appropriately?

One issue is with functions returning void, where the caller won't
even know that something may have gone wrong. Another is that
typically error paths are less commonly used, and crashing a
domain would typically be accompanied by indicating an error to
the upper layers. Hence such crashing may trigger unrelated bugs.
A third aspect is that, indeed, dying guests may need special
treatment (see the already existing ->is_dying checks we have).

I mentioned the call tree unwinding in particular because earlier
on we had domain_crash_synchronous(), which was there specifically
to avoid issues with errors (and the changed state) bubbling back
up. But this model had other issues, hence our movement away from
it.

Jan
George Dunlap Jan. 6, 2020, 4:19 p.m. UTC | #4
On 12/10/19 1:46 PM, Jan Beulich wrote:
> On 10.12.2019 11:56, George Dunlap wrote:
>> On 12/9/19 1:50 PM, Jan Beulich wrote:
>>> On 09.12.2019 12:29, George Dunlap wrote:
>>>> --- a/CODING_STYLE
>>>> +++ b/CODING_STYLE
>>>> @@ -133,3 +133,97 @@ the end of files.  It should be:
>>>>   * indent-tabs-mode: nil
>>>>   * End:
>>>>   */
>>>> +
>>>> +Handling unexpected conditions
>>>> +------------------------------
>>>> +
>>>> +GUIDELINES:
>>>> +
>>>> +Passing errors up the stack should be used when the caller is already
>>>> +expecting to handle errors, and the state when the error was
>>>> +discovered isn’t broken, or isn't too hard to fix.
>>>> +
>>>> +domain_crash() should be used when passing errors up the stack is too
>>>> +difficult, and/or when fixing up state of a guest is impractical, but
>>>> +where fixing up the state of Xen will allow Xen to continue running.
>>>> +This is particularly appropriate when the guest is exhibiting behavior
>>>> +well-behaved guest should.
>>>
>>> DYM "shouldn't"?
>>
>> Indeed.
> 
> (Btw, noticing only now - there's also either an "a" missing, or it
> wants to be "guests".)
> 
>>>> +- domain_crash() is similar to BUG_ON(), but with a more limited
>>>> +effect: it stops that domain immediately.  In situations where
>>>> +continuing might cause guest or hypervisor corruption, but destroying
>>>> +the guest allows the hypervisor to continue, this can change a more
>>>> +serious bug into a guest denial-of-service.  But in situations where
>>>> +returning an error might be safe, then domain_crash() can change a
>>>> +benign failure into a guest denial-of-service.
>>>
>>> Perhaps further put emphasis on the call tree still getting unwound
>>> normally, which may imply further actions on the (now dying) domain
>>> taken. Unfortunately it's not unusual for people to forget this; I
>>> think the IOMMU code in particular was (hopefully isn't so much
>>> anymore) a "good" example of this.
>>
>> Can you expand on this?  Do you mean to advise that care should be taken
>> when returning up the callstack that the domain which was running before
>> may now be dying, and to behave appropriately?
> 
> One issue is with functions returning void, where the caller won't
> even know that something may have gone wrong. Another is that
> typically error paths are less commonly used, and crashing a
> domain would typically be accompanied by indicating an error to
> the upper layers. Hence such crashing may trigger unrelated bugs.
> A third aspect is that, indeed, dying guests may need special
> treatment (see the already existing ->is_dying checks we have).
> 
> I mentioned the call tree unwinding in particular because earlier
> on we had domain_crash_synchronous(), which was there specifically
> to avoid issues with errors (and the changed state) bubbling back
> up. But this model had other issues, hence our movement away from
> it.

How about a paragraph like this:

---
Note however that domain_crash() has its own traps: callers far up the
call stack may not realize that the domain is now dying as a result of
an innocuous-looking operation, particularly if somewhere between the
failure and the operation, no error is returned.  Using it requires
careful inspection and documentation of the code to make sure all
callers at the stack handle a newly-dead domain gracefully.
---

 -George
Jan Beulich Jan. 6, 2020, 4:29 p.m. UTC | #5
On 06.01.2020 17:19, George Dunlap wrote:
> On 12/10/19 1:46 PM, Jan Beulich wrote:
>> On 10.12.2019 11:56, George Dunlap wrote:
>>> On 12/9/19 1:50 PM, Jan Beulich wrote:
>>>> On 09.12.2019 12:29, George Dunlap wrote:
>>>>> --- a/CODING_STYLE
>>>>> +++ b/CODING_STYLE
>>>>> @@ -133,3 +133,97 @@ the end of files.  It should be:
>>>>>   * indent-tabs-mode: nil
>>>>>   * End:
>>>>>   */
>>>>> +
>>>>> +Handling unexpected conditions
>>>>> +------------------------------
>>>>> +
>>>>> +GUIDELINES:
>>>>> +
>>>>> +Passing errors up the stack should be used when the caller is already
>>>>> +expecting to handle errors, and the state when the error was
>>>>> +discovered isn’t broken, or isn't too hard to fix.
>>>>> +
>>>>> +domain_crash() should be used when passing errors up the stack is too
>>>>> +difficult, and/or when fixing up state of a guest is impractical, but
>>>>> +where fixing up the state of Xen will allow Xen to continue running.
>>>>> +This is particularly appropriate when the guest is exhibiting behavior
>>>>> +well-behaved guest should.
>>>>
>>>> DYM "shouldn't"?
>>>
>>> Indeed.
>>
>> (Btw, noticing only now - there's also either an "a" missing, or it
>> wants to be "guests".)
>>
>>>>> +- domain_crash() is similar to BUG_ON(), but with a more limited
>>>>> +effect: it stops that domain immediately.  In situations where
>>>>> +continuing might cause guest or hypervisor corruption, but destroying
>>>>> +the guest allows the hypervisor to continue, this can change a more
>>>>> +serious bug into a guest denial-of-service.  But in situations where
>>>>> +returning an error might be safe, then domain_crash() can change a
>>>>> +benign failure into a guest denial-of-service.
>>>>
>>>> Perhaps further put emphasis on the call tree still getting unwound
>>>> normally, which may imply further actions on the (now dying) domain
>>>> taken. Unfortunately it's not unusual for people to forget this; I
>>>> think the IOMMU code in particular was (hopefully isn't so much
>>>> anymore) a "good" example of this.
>>>
>>> Can you expand on this?  Do you mean to advise that care should be taken
>>> when returning up the callstack that the domain which was running before
>>> may now be dying, and to behave appropriately?
>>
>> One issue is with functions returning void, where the caller won't
>> even know that something may have gone wrong. Another is that
>> typically error paths are less commonly used, and crashing a
>> domain would typically be accompanied by indicating an error to
>> the upper layers. Hence such crashing may trigger unrelated bugs.
>> A third aspect is that, indeed, dying guests may need special
>> treatment (see the already existing ->is_dying checks we have).
>>
>> I mentioned the call tree unwinding in particular because earlier
>> on we had domain_crash_synchronous(), which was there specifically
>> to avoid issues with errors (and the changed state) bubbling back
>> up. But this model had other issues, hence our movement away from
>> it.
> 
> How about a paragraph like this:
> 
> ---
> Note however that domain_crash() has its own traps: callers far up the
> call stack may not realize that the domain is now dying as a result of
> an innocuous-looking operation, particularly if somewhere between the
> failure and the operation, no error is returned.  Using it requires
> careful inspection and documentation of the code to make sure all
> callers at the stack handle a newly-dead domain gracefully.
> ---

Sounds good, thanks.

Jan
diff mbox series

Patch

diff --git a/CODING_STYLE b/CODING_STYLE
index 810b71c16d..5ff493224b 100644
--- a/CODING_STYLE
+++ b/CODING_STYLE
@@ -133,3 +133,97 @@  the end of files.  It should be:
  * indent-tabs-mode: nil
  * End:
  */
+
+Handling unexpected conditions
+------------------------------
+
+GUIDELINES:
+
+Passing errors up the stack should be used when the caller is already
+expecting to handle errors, and the state when the error was
+discovered isn’t broken, or isn't too hard to fix.
+
+domain_crash() should be used when passing errors up the stack is too
+difficult, and/or when fixing up state of a guest is impractical, but
+where fixing up the state of Xen will allow Xen to continue running.
+This is particularly appropriate when the guest is exhibiting behavior
+well-behaved guest should.
+
+BUG_ON() should be used when you can’t pass errors up the stack, and
+either continuing or crashing the guest would likely cause an
+information leak or privilege escalation vulnerability.
+
+ASSERT() IS NOT AN ERROR HANDLING MECHANISM.  ASSERT is a way to move
+detection of a bug earlier in the programming cycle; it is a
+more-noticeable printk.  It should only be added after one of the
+other three error-handling mechanisms has been evaluated for
+reliability and security.
+
+RATIONALE:
+
+It's frequently the case that code is written with the assumption that
+certain conditions can never happen.  There are several possible
+actions programmers can take in these situations:
+
+* Programmers can simply not handle those cases in any way, other than
+perhaps to write a comment documenting what the assumption is.
+
+* Programmers can try to handle the case gracefully -- fixing up
+in-progress state and returning an error to the user.
+
+* Programmers can crash the guest.
+
+* Programmers can use ASSERT(), which will cause the check to be
+executed in DEBUG builds, and cause the hypervisor to crash if it's
+violated
+
+* Programmers can use BUG_ON(), which will cause the check to be
+executed in both DEBUG and non-DEBUG builds, and cause the hypervisor
+to crash if it's violated.
+
+In selecting which response to use, we want to achieve several goals:
+
+- To minimize risk of introducing security vulnerabilities,
+  particularly as the code evolves over time
+
+- To efficiently spend programmer time
+
+- To detect violations of assumptions as early as possible
+
+- To minimize the impact of bugs on production use cases
+
+The guidelines above attempt to balance these:
+
+- When the caller is expecting to handle errors, and there is no
+broken state at the time the unexpected condition is discovered, or
+when fixing the state is straightforward, then fixing up the state and
+returning an error is the most robust thing to do.  However, if the
+caller isn't expecting to handle errors, or if the state is difficult
+to fix, then returning an error may require extensive refactoring,
+which is not a good use of programmer time when they're certain that
+this condition cannot occur.
+
+- BUG_ON() will stop all hypervisor action immediately.  In situations
+where continuing might allow an attacker to escalate privilege, a
+BUG_ON() can change a privilege escalation or information leak into a
+denial-of-service (an improvement).  But in situations where
+continuing (say, returning an error) might be safe, then BUG_ON() can
+change a benign failure into denial-of-service (a degradation).
+
+- domain_crash() is similar to BUG_ON(), but with a more limited
+effect: it stops that domain immediately.  In situations where
+continuing might cause guest or hypervisor corruption, but destroying
+the guest allows the hypervisor to continue, this can change a more
+serious bug into a guest denial-of-service.  But in situations where
+returning an error might be safe, then domain_crash() can change a
+benign failure into a guest denial-of-service.
+
+- ASSERT() will stop the hypervisor during development, but allow
+hypervisor action to continue during production.  In situations where
+continuing will at worst result in a denial-of-service, and at best
+may have little effect other than perhaps quirky behavior, using an
+ASSERT() will allow violation of assumptions to be detected as soon as
+possible, while not causing undue degradation in production
+hypervisors.  However, in situations where continuing could cause
+privilege escalation or information leaks, using an ASSERT() can
+introduce security vulnerabilities.