Message ID | 20200213125449.14226-1-jgross@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | xen: don't let keyhandlers block indefinitely on locks | expand |
On 13/02/2020 12:54, Juergen Gross wrote: > Keyhandlers dumping hypervisor information to the console often need > to take locks while accessing data. In order to not block in case of > system inconsistencies it is convenient to use trylock variants when > obtaining the locks. On the other hand a busy system might easily > encounter held locks, so this patch series is adding special trylock > variants with a timeout used by keyhandlers. This is a backwards step. Keyhandlers are for debugging purposes. When debugging it is far more important to get the requested data, than almost anything else. The system will cope with a multi-second outage occurring approximately never. A person debugging who can't get the data has no chance of fixing whatever problem they are looking for. This series seems to be breaking the one critical usecase for keyhandlers, to fix what - not let debugging get in the way of the smooth running of the system? A system in need of debugging in the first place has bigger problems than needing to run smoothly. The only thing which should happen to improve system stability is for keyhandlers to disable the system watchdog while they are running, in case they happen to run for seconds of wallclock time. This is an issue which isn't addressed by the series, because once a keyhandler does get a lock, it keeps it until it is done. ~Andrew
On 13.02.20 19:38, Andrew Cooper wrote: > On 13/02/2020 12:54, Juergen Gross wrote: >> Keyhandlers dumping hypervisor information to the console often need >> to take locks while accessing data. In order to not block in case of >> system inconsistencies it is convenient to use trylock variants when >> obtaining the locks. On the other hand a busy system might easily >> encounter held locks, so this patch series is adding special trylock >> variants with a timeout used by keyhandlers. > > This is a backwards step. > > Keyhandlers are for debugging purposes. When debugging it is far more > important to get the requested data, than almost anything else. Right. > > The system will cope with a multi-second outage occurring approximately > never. A person debugging who can't get the data has no chance of > fixing whatever problem they are looking for. Right. > This series seems to be breaking the one critical usecase for > keyhandlers, to fix what - not let debugging get in the way of the > smooth running of the system? A system in need of debugging in the > first place has bigger problems than needing to run smoothly. Okay, this warrants a longer default timeout. A keyhandler blocking on a lock will produce exactly no further data, and it will probably block other keyhandlers, too, due to hogging at least one cpu completely. With a longer lock timeout (1 second?) there is a much higher chance that the keyhandler will finish its job producing more data than without any timeout. BTW, during development of my core scheduling series I was hit by that problem multiple times. With the lock timeout I'd have spared dozens of reboots. > The only thing which should happen to improve system stability is for > keyhandlers to disable the system watchdog while they are running, in > case they happen to run for seconds of wallclock time. This is an issue > which isn't addressed by the series, because once a keyhandler does get > a lock, it keeps it until it is done. Right, will add disabling the watchdog during keyhandler action. Juergen
On 13.02.2020 19:38, Andrew Cooper wrote: > On 13/02/2020 12:54, Juergen Gross wrote: >> Keyhandlers dumping hypervisor information to the console often need >> to take locks while accessing data. In order to not block in case of >> system inconsistencies it is convenient to use trylock variants when >> obtaining the locks. On the other hand a busy system might easily >> encounter held locks, so this patch series is adding special trylock >> variants with a timeout used by keyhandlers. > > This is a backwards step. > > Keyhandlers are for debugging purposes. When debugging it is far more > important to get the requested data, than almost anything else. > > The system will cope with a multi-second outage occurring approximately > never. A person debugging who can't get the data has no chance of > fixing whatever problem they are looking for. > > This series seems to be breaking the one critical usecase for > keyhandlers, to fix what - not let debugging get in the way of the > smooth running of the system? A system in need of debugging in the > first place has bigger problems than needing to run smoothly. I certainly accept what you say further up, but I don't think this last statement is universally true. There may be a single guest in trouble, which - to find out about its state - some debugging keys may want issuing. Disturbing the host and all other guests for this is not a good idea imo. Jan
Hi Jan, On 14/02/2020 09:37, Jan Beulich wrote: > On 13.02.2020 19:38, Andrew Cooper wrote: >> On 13/02/2020 12:54, Juergen Gross wrote: >>> Keyhandlers dumping hypervisor information to the console often need >>> to take locks while accessing data. In order to not block in case of >>> system inconsistencies it is convenient to use trylock variants when >>> obtaining the locks. On the other hand a busy system might easily >>> encounter held locks, so this patch series is adding special trylock >>> variants with a timeout used by keyhandlers. >> >> This is a backwards step. >> >> Keyhandlers are for debugging purposes. When debugging it is far more >> important to get the requested data, than almost anything else. >> >> The system will cope with a multi-second outage occurring approximately >> never. A person debugging who can't get the data has no chance of >> fixing whatever problem they are looking for. >> >> This series seems to be breaking the one critical usecase for >> keyhandlers, to fix what - not let debugging get in the way of the >> smooth running of the system? A system in need of debugging in the >> first place has bigger problems than needing to run smoothly. > > I certainly accept what you say further up, but I don't think this > last statement is universally true. There may be a single guest in > trouble, which - to find out about its state - some debugging keys > may want issuing. Disturbing the host and all other guests for this > is not a good idea imo. This seems to suggest that you only want information for a single guest. Therefore using debugging keys was already a bad idea because it will disturb all the other guests. For your setup, it might be worth considering to extend xenctx or introduce a way to dump information for a specific domain. Cheers,