Message ID | 129a431c1a12610fa7b44f76ce73aa8058f55bc6.1655368610.git.bristot@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | The Runtime Verification (RV) interface | expand |
On Thu, Jun 16, 2022 at 10:45:02AM +0200, Daniel Bristot de Oliveira wrote: > Adds documentation about the safe_wtd and safe_wtd_nwo RV monitors, > and their usage via a safety application. > > Cc: Wim Van Sebroeck <wim@linux-watchdog.org> > Cc: Guenter Roeck <linux@roeck-us.net> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Will Deacon <will@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Marco Elver <elver@google.com> > Cc: Dmitry Vyukov <dvyukov@google.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Shuah Khan <skhan@linuxfoundation.org> > Cc: Gabriele Paoloni <gpaoloni@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Clark Williams <williams@redhat.com> > Cc: linux-doc@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-trace-devel@vger.kernel.org > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > --- > Documentation/trace/rv/watchdog-monitor.rst | 250 ++++++++++++++++++++ > 1 file changed, 250 insertions(+) > create mode 100644 Documentation/trace/rv/watchdog-monitor.rst > > diff --git a/Documentation/trace/rv/watchdog-monitor.rst b/Documentation/trace/rv/watchdog-monitor.rst > new file mode 100644 > index 000000000000..2b142fb31572 > --- /dev/null > +++ b/Documentation/trace/rv/watchdog-monitor.rst > @@ -0,0 +1,250 @@ > +Watchdog monitor > +---------------- > + > +The watchdog is an essential building block for the usage of Linux in > +safety-critical systems because it allows the system to be monitored from > +an external element - the watchdog hardware, acting as a safety-monitor. > + > +A user-space application controls the watchdog device via the watchdog > +interface. This application, hereafter safety_app, enables the watchdog > +and periodically pets the watchdog upon correct completion of the safety > +related processing. > + > +If the safety_app, for any reason, stops pinging the watchdog, > +the watchdog hardware can set the system in a fail-safe state. For > +example, shutting the system down. > + > +Given the importance of the safety_app / watchdog hardware couple, > +the interaction between these software pieces also needs some > +sort of monitoring. In other words, "who monitors the monitor?" > + > +The safe watchdog (safe_wtd) RV monitor monitors the interaction between > +the safety_app and the watchdog device, enforcing the correct sequence of > +events that leads the system to a safe state. > + > +Furthermore, the safety_app can monitor the RV monitor by collecting the > +events generated by the RV monitor itself via tracing interface. In this way, > +closing the monitoring loop with the safety_app. > + > +A diagram of the components and their interactions is:: > + > + user-space: > + +--------------------------------+ > + | safety_app |-----------+ > + +--------------------------------+ | > + | ^ | > + | Configure | Enable and | > + | | check data | > + ===================+====================+=============== | > + kernel-space: | | | > + v v | > + +----------+ instr. +-------------+ | > + | watchdog | ----------->| RV Monitor |----+ | > + | device | +-------------+ | | > + +----------+ | | > + | | | > + | | | > + ================+====================================== | | > + hardware: | | | > + v | +-> Bring the system > + +--------------------+ +----> to a safe state, > + | watchdog hardware |---------------------------> e.g., halt. > + +--------------------+ > + > +Sample safety_app > +----------------- > + > +The user-space safety_app sample code in ``tools/verification/safety_app/`` > +serves to illustrate the usage of the RV monitors for this use-case, as > +well as the starting point to the development of a user-specific safety_app. > + > +Watchdog events > +--------------- > + > +The RV monitor observes the watchdog by using instrumentation to > +process the events generated by the interaction between the > +safety_app and the watchdog device layer in kernel. > + > +The monitored events are: > + > + - watchdog:watchdog_open: open the watchdog device; > + - watchdog:watchdog_close: close the watchdog device; > + - watchdog:watchdog_start: start the watchdog; > + - watchdog:watchdog_stop: stop the watchdog; > + - watchdog:watchdog_set_timeout: set the watchdog timeout; > + - watchdog:watchdog_ping: reprogram the watchdog with the previously set > + timeout; > + - watchdog:watchdog_nowayout: prevents the watchdog from stopping; > + - watchdog:watchdog_set_keep_alive: set an intermediary ping to overcome > + the limitation of a hardware watchdog maximum timeout being shorter than > + the timeout set by the user-space tool; > + - watchdog:watchdog_keep_alive: the execution of the function that runs the > + intermediary keep alive ping; > + > +RV monitor events > +----------------- > + > +The RV monitor monitors the relevant events as an outside observer, > +interpreting all the components (the hardware; the watchdog device > +interface; and the safety monitor) as an integrated component. > + > +The events selected for the monitor are: > + > + - other_threads: an event generated by any thread other than the > + one that set nowayout or open the watchdog the last time. > + - open: a thread opens the watchdog to manipulate it; > + - close: a thread closes the watchdog; > + - start: starts the watchdog countdown; > + - stop: stops the watchdog; > + - set_safe_timeout: configures the watchdog with a given timeout; > + - ping: resets the watchdog countdown with the previously configured timeout; > + - nowayout: prevents the watchdog to be stopped until the system's shutdown; > + - sched_keep_alive: schedules a kernel worker to ping the watchdog if the > + timeout is longer than the watchdog hardware can handle. > + - keep_alive: executes the previously scheduled watchdog ping; > + > +Noting that the events that does not appear in the automata models are > +considered blocked events, and their execution will always cause the > +RV monitor to react to an unexpected event. > + > +RV monitor specification > +------------------------ > + > +The monitor's goal is to assess a set of specifications that conducts the > +system to a safe state. > + > +These specifications are: > + > + - 1: Once open, only one process manipulates the watchdog; > + - 2: Following 1, the keep-alive mechanisms will not be used; > + - 3: If required, nowayout will be set before opening the watchdog; > + - 4: A safe timeout must be set; > + - 5: At least one ping must be made before entering the safe/safe_nwo states > + - 6: The RV monitor does not react if the watchdog is closed without stopping. > + But the hardware watchdog is expected to react. > + > +Deterministic automata monitors > +------------------------------- > + > +Following the specifications, a deterministic automata monitor > +was developed. The monitor is modeled as Deterministic Automata model. > + > +The deterministic automata model for safe_wtd is:: > + > + #==================================# other_threads > + H H ----------------+ > + -----------> H init H | > + H H <---------------+ > + #==================================# > + | | ^ > + | | | close > + | | +----------------------------------------------------+ > + | | | > + | | open | > + | +------------------------------------------------------+ | > + | | | > + | nowayout | | > + v | | > + nowayout +-------------------+ | | > + other_threads | | nowayout | | > + +---------------- | nwo |<-------------------------------------+ | | > + | | | | | | > + +---------------> | | <+ | | | > + +-------------------+ | | | | > + | | | | | > + | open | close | | | > + v | | | | > + +-------------------+ | | | | > + | opened_nwo | -+ | | | > + +-------------------+ | | | > + | | | | > + | start | | | > + v | | | > + +-------------------+ | | | > + +---------------> | started_nwo | -+ | | | > + | +-------------------+ | | | | > + | | | | | | > + | open | set_safe_timeout | | | | > + | v | | | | > + | +-------------------+ | | | | > + | | set_nwo | | | | | > + | +-------------------+ | | | | > + | | | | | | > + | +-------------+ | ping | | | | > + | | | | | | | | > + | | ping v v | | | | > + | | +-------------------+ | | | | > + | +-----------| safe_nwo | | | | | > + | +-------------------+ | | | | > + | | | | | | > + | | close | close | | | > + | v v | | | > + | +----------------------------------+ nowayout | | | > + | | | other_threads | | | > + | | closed_running_nwo | ----------------+ | | | > + | | | | | | | > + +---------------- | | <---------------+ | | | > + +----------------------------------+ | | | > + | nowayout ^ | | | > + +-----------------------------+ | | | > + | | | > + | | | > + +-------------------+ +--------+ | | | > + | | | |------+---+ | > + | started | start | opened | | | > + +---------------- | | <-------- | |>-----+-------+ > + | +-------------------+ +--------+ | ^ > + | | | | > + | | set_safe_timeout +-------------+-------+ > + | v | | > + | +-------------------+ | | > + | | | | | > + | | set | | | > + +----------+---------------> | | | | > + | | +-------------------+ | | > + | | | | | > + | | | ping | | > + | | v | | > + | | +-------------------+ ping | | > + | | | | -------+ | | > + | | +---- | safe | | | | > + | | | | | <------+ | | > + | | | +-------------------+ | | > + | | | | | | > + | | stop | | stop | | > + | | | v | | > + | | | +-------------------+ close | | > + | +-----------+---> | stopped |-------------+ | > + | | +-------------------+ | > + | +---+ | > + | | close | > + | v | > + | other_threads +----------------------------------------+ | > + | +--------------> | | | > + | | | closed_running | | > + | +--------------- | |--------------+ > + | +----------------------------------------+ > + | | ^ > + | open | | close > + | v | > + | set_safe_timeout +-------------------+ > + +-------------------------> | reopened | > + +-------------------+ So I must check the model matrix in patch18 and the above graph to check if the matrix have the not right value or something in this graph go wrong. open o init state ------> opened state: there lack a direction arrow. nwo o closed_running state -------> closed_running_nwo state: this is the state trasfer indicated from matrix, but the above graph is not consistent with it. In graph: nwo closed_running state -------> nwo state. I prefer the state closed_running_nwo. nwo other_thread o closed_running_nwo state ---------------> closed_running_nwo state. The graph above have a duplicate nwo event leading to state transfer. open closed_running_nwo state ---------------> started_nwo state. I think it should transfer to opened_nwo state. set_safe_timeout o reopened state ------------------> set state: the direction is from reopened state to set state. The graph above have two direction arrow. stop reopenped state ------------------> opened state. why stop event trigger it to opened state not stoped state. I think it should transfer to stoped state like the started state do. o started state do not have link to closed_running state by close event in graph. Like started_nwo state do, it should have a direct close event to link to closed_running state. Modified the above graph like this: + + #==================================# other_threads + H H ----------------+ + -----------> H init H | + H H <---------------+ + #==================================# + | | ^ + | | | close + | | +----------------------------------------------------+ + | | | + | | open | + | +------------------------------------------------------+ | + | | | + | nowayout | | + v | | + nowayout +-------------------+ | | + other_threads | | | | + +---------------- | nwo | | | + | | | | | + +---------------> | | <+ | | + +-------------------+ | | | + | | | | + | open | close | | + v | | | + +-------------------+ | | | + +---------------> | opened_nwo | -+ | | + | +-------------------+ | | + | | | | + | | start | | + | v | | + | +-------------------+ | | + | | started_nwo | -+ | | + | +-------------------+ | | | + | | | | | + | open | set_safe_timeout | | | + | v | | | + | +-------------------+ | | | + | | set_nwo | | | | + | +-------------------+ | | | + | | | | | + | +-------------+ | ping | | | + | | | | | | | + | | ping v v | | | + | | +-------------------+ | | | + | +-----------| safe_nwo | | | | + | +-------------------+ | | | + | | | | | + | | close | close | | + | v v | | + | +----------------------------------+ nowayout | | + | | | other_threads | | + | | closed_running_nwo | ----------------+ | | + | | | | | | + +---------------- | | <---------------+ | | + +----------------------------------+ | | + ^ | | + | nowayout | | + +--------------------------------------+ | | + | | | + +-------------------+ +--------+ | | | + | | start | |<----+---+ | + | started | <--------- | opened | | | + +---------------- | | ---------+ | |>----+-------+ + | +-------------------+ close | +--------+ | ^ + | | | | | + | | set_safe_timeout | +-----------+-------+ + | v | | | + | +-------------------+ | | | + | | | | | | + | | set | | | | + +----------+---------------> | | | | | + | | +-------------------+ | | | + | | | | | | + | | | ping | | | + | | v | | | + | | +-------------------+ ping | | | + | | | | -------+ | | | + | | +---- | safe | | | | | + | | | | | <------+ | | | + | | | +-------------------+ | | | + | | | | | | | + | | stop | | stop | | | + | | | v | | | + | | | +-------------------+ close | | | + | +-----------+---> | stopped |----------+----+ | + | | +-------------------+ | | + | +---+ ^ | | + | | close | | | + | | +------------------+------+ | | v v | | + | other_threads +----------------------------------------+ | | + | +--------------> | | | | + | | | closed_running | | | + | +--------------- | |----+---------+ + | +----------------------------------------+ | + | | ^ | + | open | | close |stop + | v | | + | set_safe_timeout +-------------------+ | + +-------------------------- | reopened |------------------+ + +-------------------+ Thanks > +It is important to note that the events sched_keep_alive and keep_alive > +are not allowed in the monitor (they are said to be blocked events). > +The execution of any blocked events leads the RV monitor to react. > + > +Additional options > +------------------ > + > +The RV monitor also has a set of options enabled via kernel command > +line/module options. They are: > + > + - watchdog_id: the device id to monitor (default 0); > + - dont_stop: once enabled, do not allow the RV monitor to be stopped (default off); > + - safe_timeout: define a maximum safe value that a user-space application can > + set as the watchdog timeout (default unlimited); > + - check_timeout: After every ping, check if the time left in the watchdog is less > + than or equal to the last timeout set for the watchdog. It only works for watchdog > + devices that provide the get_timeleft() function (default off); > -- > 2.35.1 >
On 7/7/22 14:41, Tao Zhou wrote: > So I must check the model matrix in patch18 and the above graph to > check if the matrix have the not right value or something in this > graph go wrong. It was my fault, the ASCII art generated by graph-easy was too wide, so I had to manually change it... and so I broke it. At least this validates that it is better to have it automatically generated :-) Thanks for reviewing it Tao, I do appreciate these kind of reviews!!!! -- Daniel
diff --git a/Documentation/trace/rv/watchdog-monitor.rst b/Documentation/trace/rv/watchdog-monitor.rst new file mode 100644 index 000000000000..2b142fb31572 --- /dev/null +++ b/Documentation/trace/rv/watchdog-monitor.rst @@ -0,0 +1,250 @@ +Watchdog monitor +---------------- + +The watchdog is an essential building block for the usage of Linux in +safety-critical systems because it allows the system to be monitored from +an external element - the watchdog hardware, acting as a safety-monitor. + +A user-space application controls the watchdog device via the watchdog +interface. This application, hereafter safety_app, enables the watchdog +and periodically pets the watchdog upon correct completion of the safety +related processing. + +If the safety_app, for any reason, stops pinging the watchdog, +the watchdog hardware can set the system in a fail-safe state. For +example, shutting the system down. + +Given the importance of the safety_app / watchdog hardware couple, +the interaction between these software pieces also needs some +sort of monitoring. In other words, "who monitors the monitor?" + +The safe watchdog (safe_wtd) RV monitor monitors the interaction between +the safety_app and the watchdog device, enforcing the correct sequence of +events that leads the system to a safe state. + +Furthermore, the safety_app can monitor the RV monitor by collecting the +events generated by the RV monitor itself via tracing interface. In this way, +closing the monitoring loop with the safety_app. + +A diagram of the components and their interactions is:: + + user-space: + +--------------------------------+ + | safety_app |-----------+ + +--------------------------------+ | + | ^ | + | Configure | Enable and | + | | check data | + ===================+====================+=============== | + kernel-space: | | | + v v | + +----------+ instr. +-------------+ | + | watchdog | ----------->| RV Monitor |----+ | + | device | +-------------+ | | + +----------+ | | + | | | + | | | + ================+====================================== | | + hardware: | | | + v | +-> Bring the system + +--------------------+ +----> to a safe state, + | watchdog hardware |---------------------------> e.g., halt. + +--------------------+ + +Sample safety_app +----------------- + +The user-space safety_app sample code in ``tools/verification/safety_app/`` +serves to illustrate the usage of the RV monitors for this use-case, as +well as the starting point to the development of a user-specific safety_app. + +Watchdog events +--------------- + +The RV monitor observes the watchdog by using instrumentation to +process the events generated by the interaction between the +safety_app and the watchdog device layer in kernel. + +The monitored events are: + + - watchdog:watchdog_open: open the watchdog device; + - watchdog:watchdog_close: close the watchdog device; + - watchdog:watchdog_start: start the watchdog; + - watchdog:watchdog_stop: stop the watchdog; + - watchdog:watchdog_set_timeout: set the watchdog timeout; + - watchdog:watchdog_ping: reprogram the watchdog with the previously set + timeout; + - watchdog:watchdog_nowayout: prevents the watchdog from stopping; + - watchdog:watchdog_set_keep_alive: set an intermediary ping to overcome + the limitation of a hardware watchdog maximum timeout being shorter than + the timeout set by the user-space tool; + - watchdog:watchdog_keep_alive: the execution of the function that runs the + intermediary keep alive ping; + +RV monitor events +----------------- + +The RV monitor monitors the relevant events as an outside observer, +interpreting all the components (the hardware; the watchdog device +interface; and the safety monitor) as an integrated component. + +The events selected for the monitor are: + + - other_threads: an event generated by any thread other than the + one that set nowayout or open the watchdog the last time. + - open: a thread opens the watchdog to manipulate it; + - close: a thread closes the watchdog; + - start: starts the watchdog countdown; + - stop: stops the watchdog; + - set_safe_timeout: configures the watchdog with a given timeout; + - ping: resets the watchdog countdown with the previously configured timeout; + - nowayout: prevents the watchdog to be stopped until the system's shutdown; + - sched_keep_alive: schedules a kernel worker to ping the watchdog if the + timeout is longer than the watchdog hardware can handle. + - keep_alive: executes the previously scheduled watchdog ping; + +Noting that the events that does not appear in the automata models are +considered blocked events, and their execution will always cause the +RV monitor to react to an unexpected event. + +RV monitor specification +------------------------ + +The monitor's goal is to assess a set of specifications that conducts the +system to a safe state. + +These specifications are: + + - 1: Once open, only one process manipulates the watchdog; + - 2: Following 1, the keep-alive mechanisms will not be used; + - 3: If required, nowayout will be set before opening the watchdog; + - 4: A safe timeout must be set; + - 5: At least one ping must be made before entering the safe/safe_nwo states + - 6: The RV monitor does not react if the watchdog is closed without stopping. + But the hardware watchdog is expected to react. + +Deterministic automata monitors +------------------------------- + +Following the specifications, a deterministic automata monitor +was developed. The monitor is modeled as Deterministic Automata model. + +The deterministic automata model for safe_wtd is:: + + #==================================# other_threads + H H ----------------+ + -----------> H init H | + H H <---------------+ + #==================================# + | | ^ + | | | close + | | +----------------------------------------------------+ + | | | + | | open | + | +------------------------------------------------------+ | + | | | + | nowayout | | + v | | + nowayout +-------------------+ | | + other_threads | | nowayout | | + +---------------- | nwo |<-------------------------------------+ | | + | | | | | | + +---------------> | | <+ | | | + +-------------------+ | | | | + | | | | | + | open | close | | | + v | | | | + +-------------------+ | | | | + | opened_nwo | -+ | | | + +-------------------+ | | | + | | | | + | start | | | + v | | | + +-------------------+ | | | + +---------------> | started_nwo | -+ | | | + | +-------------------+ | | | | + | | | | | | + | open | set_safe_timeout | | | | + | v | | | | + | +-------------------+ | | | | + | | set_nwo | | | | | + | +-------------------+ | | | | + | | | | | | + | +-------------+ | ping | | | | + | | | | | | | | + | | ping v v | | | | + | | +-------------------+ | | | | + | +-----------| safe_nwo | | | | | + | +-------------------+ | | | | + | | | | | | + | | close | close | | | + | v v | | | + | +----------------------------------+ nowayout | | | + | | | other_threads | | | + | | closed_running_nwo | ----------------+ | | | + | | | | | | | + +---------------- | | <---------------+ | | | + +----------------------------------+ | | | + | nowayout ^ | | | + +-----------------------------+ | | | + | | | + | | | + +-------------------+ +--------+ | | | + | | | |------+---+ | + | started | start | opened | | | + +---------------- | | <-------- | |>-----+-------+ + | +-------------------+ +--------+ | ^ + | | | | + | | set_safe_timeout +-------------+-------+ + | v | | + | +-------------------+ | | + | | | | | + | | set | | | + +----------+---------------> | | | | + | | +-------------------+ | | + | | | | | + | | | ping | | + | | v | | + | | +-------------------+ ping | | + | | | | -------+ | | + | | +---- | safe | | | | + | | | | | <------+ | | + | | | +-------------------+ | | + | | | | | | + | | stop | | stop | | + | | | v | | + | | | +-------------------+ close | | + | +-----------+---> | stopped |-------------+ | + | | +-------------------+ | + | +---+ | + | | close | + | v | + | other_threads +----------------------------------------+ | + | +--------------> | | | + | | | closed_running | | + | +--------------- | |--------------+ + | +----------------------------------------+ + | | ^ + | open | | close + | v | + | set_safe_timeout +-------------------+ + +-------------------------> | reopened | + +-------------------+ + +It is important to note that the events sched_keep_alive and keep_alive +are not allowed in the monitor (they are said to be blocked events). +The execution of any blocked events leads the RV monitor to react. + +Additional options +------------------ + +The RV monitor also has a set of options enabled via kernel command +line/module options. They are: + + - watchdog_id: the device id to monitor (default 0); + - dont_stop: once enabled, do not allow the RV monitor to be stopped (default off); + - safe_timeout: define a maximum safe value that a user-space application can + set as the watchdog timeout (default unlimited); + - check_timeout: After every ping, check if the time left in the watchdog is less + than or equal to the last timeout set for the watchdog. It only works for watchdog + devices that provide the get_timeleft() function (default off);
Adds documentation about the safe_wtd and safe_wtd_nwo RV monitors, and their usage via a safety application. Cc: Wim Van Sebroeck <wim@linux-watchdog.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> --- Documentation/trace/rv/watchdog-monitor.rst | 250 ++++++++++++++++++++ 1 file changed, 250 insertions(+) create mode 100644 Documentation/trace/rv/watchdog-monitor.rst