Message ID | 20211018222157.12238-1-schmitzmic@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] block - ataflop.c: fix breakage introduced at blk-mq refactoring | expand |
On 10/18/21 4:21 PM, Michael Schmitz wrote: > Refactoring of the Atari floppy driver when converting to blk-mq > has broken the state machine in not-so-subtle ways: > > finish_fdc() must be called when operations on the floppy device > have completed. This is crucial in order to relase the ST-DMA > lock, which protects against concurrent access to the ST-DMA > controller by other drivers (some DMA related, most just related > to device register access - broken beyond compare, I know). > > When rewriting the drivers' old do_request() function, the fact > that finish_fdc() was called only when all queued requests had > completed appears to have been overlooked. Instead, the new > request function calls finish_fdc() immediately after the last > request has been queued. finish_fdc() executes a dummy seek after > most requests, and this overwrites the state machine's interrupt > hander that was set up to wait for completion of the read/write > request just prior. To make matters worse, finish_fdc() is called > before device interrupts are re-enabled, making certain that the > read/write interupt is missed. > > Shifting the finish_fdc() call into the read/write request completion > handler ensures the driver waits for the request to actually complete. Was going to ask if this driver was used by anyone, since it's taken 3 years for the breakage to be spotted... In all fairness, it was pretty horribly broken before the change too (like waiting in request_fn, under a lock). So I'm curious, are you actively using it, or was it just an exercise in curiosity? > Testing this change, I've only ever seen single sector requests with the > 'last' flag set. If there is a way to send requests to the driver > without that flag set, I'd appreciate a hint. As it now stands, > the driver won't release the ST-DMA lock on requests that don't have > this flag set, but won't accept further requests because the attempt > to acquire the already-held lock once more will fail. 'last' is set if it's the last of a sequence of ->queue_rq() calls. If you just do sync IO, then last is always set, as there is no sequence. It's not hard to generate sequences, but on a floppy with basically no queue depth the most you'd ever get is 2. You could try and set: /sys/block/<dev>/queue/max_sectors_kb to 4 for example, and then do something that generates a larger than 4k write or read. Ideally that should give you more than 1.
On Mon, 18 Oct 2021, Jens Axboe wrote: > Was going to ask if this driver was used by anyone, since it's taken 3 > years for the breakage to be spotted... A lack of bug reports never implied a lack of users (or potential users). That falacy is really getting tiresome. It is much more difficult to report regressions than it is to use a workaround (i.e. boot a known good kernel). And I have plenty of sympathy for end-users who may assume that the people and corporations who create the breakage will take responsibility for fixing it. Do maintainers really expect innocent end users to report and bisect regressions, and also test a series of potential fixes until one of them is finally found to both work and pass code review?
On 10/18/21 4:51 PM, Finn Thain wrote: > > On Mon, 18 Oct 2021, Jens Axboe wrote: > >> Was going to ask if this driver was used by anyone, since it's taken 3 >> years for the breakage to be spotted... > > A lack of bug reports never implied a lack of users (or potential users). > That falacy is really getting tiresome. If it's utterly broken, I'd say yes, it very much does imply that really nobody is using it. > It is much more difficult to report regressions than it is to use a > workaround (i.e. boot a known good kernel). And I have plenty of sympathy > for end-users who may assume that the people and corporations who create > the breakage will take responsibility for fixing it. We're talking about a floppy driver here, and one for ATARI no less. It's not much of a leap of faith to assume that a) those users are more savvy than the average computer user, as they have to compile their own kernels anyway. b) that there are essentially zero of them left. The number is clearly different from zero, but I doubt by much. Hence it would stand to reason that if someone was indeed in the group of ATARI floppy users that they would know how to report a bug. Not to mention that it was pretty broken to begin with, so can't have been used much before either. The reason I ask is always to have an eye out for what drivers can be eventually removed. The older drivers, and particurly the floppy ones, are quite horrible and would need a lot of work to bring up to modern standards in terms of how they are written. And if nobody is really using them, then we'd all be better off cutting some of that dead baggage. > Do maintainers really expect innocent end users to report and bisect > regressions, and also test a series of potential fixes until one of them > is finally found to both work and pass code review? If someone reports a bug to me, the most basic is usually "It worked in this version, it's broken in this one". Then you take it from there, depending on the abilities of the user. I'd only ask someone to bisect an issue if it's really puzzling and the user is capable and willing. But it doesn't take much to send a simple email saying that something used to work and now it's broken.
On Mon, 18 Oct 2021, Jens Axboe wrote: > > It is much more difficult to report regressions than it is to use a > > workaround (i.e. boot a known good kernel). And I have plenty of > > sympathy for end-users who may assume that the people and corporations > > who create the breakage will take responsibility for fixing it. > > We're talking about a floppy driver here, and one for ATARI no less. > It's not much of a leap of faith to assume that > > a) those users are more savvy than the average computer user, as they > have to compile their own kernels anyway. > > b) that there are essentially zero of them left. The number is clearly > different from zero, but I doubt by much. > Well, that assumption is as dangerous as any. The floppy interface is still important even if most of the old mechanisms have been replaced. http://hxc2001.free.fr/floppy_drive_emulator/ https://amigastore.eu/en/220-sd-floppy-emulator-rev-c.html https://www.bigmessowires.com/floppy-emu/ > Hence it would stand to reason that if someone was indeed in the group > of ATARI floppy users that they would know how to report a bug. Yes, it would if the premise was valid. But the premise is just a flawed assumption.
On 10/18/21 5:17 PM, Finn Thain wrote: > On Mon, 18 Oct 2021, Jens Axboe wrote: > >>> It is much more difficult to report regressions than it is to use a >>> workaround (i.e. boot a known good kernel). And I have plenty of >>> sympathy for end-users who may assume that the people and corporations >>> who create the breakage will take responsibility for fixing it. >> >> We're talking about a floppy driver here, and one for ATARI no less. >> It's not much of a leap of faith to assume that >> >> a) those users are more savvy than the average computer user, as they >> have to compile their own kernels anyway. >> >> b) that there are essentially zero of them left. The number is clearly >> different from zero, but I doubt by much. >> > > Well, that assumption is as dangerous as any. The floppy interface is > still important even if most of the old mechanisms have been replaced. > > http://hxc2001.free.fr/floppy_drive_emulator/ > https://amigastore.eu/en/220-sd-floppy-emulator-rev-c.html > https://www.bigmessowires.com/floppy-emu/ > >> Hence it would stand to reason that if someone was indeed in the group >> of ATARI floppy users that they would know how to report a bug. > > Yes, it would if the premise was valid. But the premise is just a flawed > assumption. Oh please, can we skip the empty words, this is tiresome and unproductive. Since you apparently have a much better grasp on this than I do, answer me this: 1) How many users of ataflop are there? 2) How big of a subset of that group are capable of figuring out where to send a bug report? By your reasoning, any bug would go unreported for years, no matter how big the user group is. That is patently false. It's most commonly a combination of how hard it is to hit, and how many can potentially hit it. Yes, some people will work around a bug, but others will not. Hence a subset of people that hit it will report it. Decades of bug reports have proven this to be true on my end. Nobody has reported the ataflop issue in 3 years. Either these people never upgrade (which may be true), or none of them are using ataflop. It's as simple as that.
Hi Jens, On 19/10/21 11:30, Jens Axboe wrote: > Was going to ask if this driver was used by anyone, since it's taken 3 Can't honestly say - I'm not following any other user forum than linux-m68k (and that's not really a user forum either). > years for the breakage to be spotted... In all fairness, it was pretty > horribly broken before the change too (like waiting in request_fn, under > a lock). In all fairness, it was a pretty broken design, but it did at least work. I concede that it was unmaintainable in its old form, and still largely is, just surprised that I didn't see a call for testing on linux-m68k, considering the committer realized it probably wouldn't work. > So I'm curious, are you actively using it, or was it just an exercise in > curiosity? I've used it quite a bit in the past, but not for many years. For legacy hardware, floppies are often the only way to get data on or off the device, and I consider this driver an important fallback option should my network adapter (which is a pretty horrible kludge to use an old ISA NE2000 card on the ROM cartridge port) fail. But then, any use of this legacy hardware is an exercise in curiosity mostly. > >> Testing this change, I've only ever seen single sector requests with the >> 'last' flag set. If there is a way to send requests to the driver >> without that flag set, I'd appreciate a hint. As it now stands, >> the driver won't release the ST-DMA lock on requests that don't have >> this flag set, but won't accept further requests because the attempt >> to acquire the already-held lock once more will fail. > > 'last' is set if it's the last of a sequence of ->queue_rq() calls. If > you just do sync IO, then last is always set, as there is no sequence. > It's not hard to generate sequences, but on a floppy with basically no > queue depth the most you'd ever get is 2. You could try and set: > > /sys/block/<dev>/queue/max_sectors_kb > > to 4 for example, and then do something that generates a larger than 4k > write or read. Ideally that should give you more than 1. Thanks, tried that - that does indeed cause multiple requests queued to the driver (which rejects them promptly). Now fails because ataflop_commit_rqs() unconditionally calls finish_fdc() right after the first request started processing- and promptly wipes it again. What is the purpose of .commit_rqs? The PC legacy floppy driver doesn't use it ... Cheers, Michael
On 10/18/21 5:35 PM, Michael Schmitz wrote: > Hi Jens, > > On 19/10/21 11:30, Jens Axboe wrote: >> Was going to ask if this driver was used by anyone, since it's taken 3 > > Can't honestly say - I'm not following any other user forum than > linux-m68k (and that's not really a user forum either). > >> years for the breakage to be spotted... In all fairness, it was pretty >> horribly broken before the change too (like waiting in request_fn, under >> a lock). > > In all fairness, it was a pretty broken design, but it did at least > work. I concede that it was unmaintainable in its old form, and still > largely is, just surprised that I didn't see a call for testing on > linux-m68k, considering the committer realized it probably wouldn't work. I don't remember the details on that front, it's usually very difficult to get people to test this kind of change, unfortunately. But thanks for tackling it now! >> So I'm curious, are you actively using it, or was it just an exercise in >> curiosity? > > I've used it quite a bit in the past, but not for many years. For legacy > hardware, floppies are often the only way to get data on or off the > device, and I consider this driver an important fallback option should > my network adapter (which is a pretty horrible kludge to use an old ISA > NE2000 card on the ROM cartridge port) fail. > > But then, any use of this legacy hardware is an exercise in curiosity > mostly. OK, that's good enough then. Was mostly just curious if was actually being used. >>> Testing this change, I've only ever seen single sector requests with the >>> 'last' flag set. If there is a way to send requests to the driver >>> without that flag set, I'd appreciate a hint. As it now stands, >>> the driver won't release the ST-DMA lock on requests that don't have >>> this flag set, but won't accept further requests because the attempt >>> to acquire the already-held lock once more will fail. >> >> 'last' is set if it's the last of a sequence of ->queue_rq() calls. If >> you just do sync IO, then last is always set, as there is no sequence. >> It's not hard to generate sequences, but on a floppy with basically no >> queue depth the most you'd ever get is 2. You could try and set: >> >> /sys/block/<dev>/queue/max_sectors_kb >> >> to 4 for example, and then do something that generates a larger than 4k >> write or read. Ideally that should give you more than 1. > > Thanks, tried that - that does indeed cause multiple requests queued to > the driver (which rejects them promptly). > > Now fails because ataflop_commit_rqs() unconditionally calls > finish_fdc() right after the first request started processing- and > promptly wipes it again. > > What is the purpose of .commit_rqs? The PC legacy floppy driver doesn't > use it ... You only need to care about bd->last if you have something in the driver that can make it cheaper to commit more than one request. An example is a driver that fills in requests, and then has an operation to ring the submission doorbell to flush them out. The latter is what ->commit_rqs is for. For a floppy driver, just ignore bd->last and don't implement commit_rqs, I don't think we're squeezing a lot of extra efficiency out of it through that! Think many hundreds of thousands of IOPS or millions of IOPS, not a handful of IOPS or less.
On Tue, Oct 19, 2021 at 12:35:27PM +1300, Michael Schmitz wrote: > Hi Jens, > > On 19/10/21 11:30, Jens Axboe wrote: > > Was going to ask if this driver was used by anyone, since it's taken 3 > > Can't honestly say - I'm not following any other user forum than linux-m68k > (and that's not really a user forum either). > > > years for the breakage to be spotted... In all fairness, it was pretty > > horribly broken before the change too (like waiting in request_fn, under > > a lock). > > In all fairness, it was a pretty broken design, but it did at least work. I > concede that it was unmaintainable in its old form, and still largely is, > just surprised that I didn't see a call for testing on linux-m68k, > considering the committer realized it probably wouldn't work. I recall cc'ing some people that I hoped could point me at the right place to test this, but I don't think I was aware of the linux-m68k mailing list.
On Mon, 18 Oct 2021, Jens Axboe wrote: > > Oh please, can we skip the empty words, this is tiresome and > unproductive. Since you apparently have a much better grasp on this than > I do, answer me this: > > 1) How many users of ataflop are there? > > 2) How big of a subset of that group are capable of figuring out where > to send a bug report? > Both good questions. Here are some more. 3) How many users is sufficient to justify the cost of keeping ataflop around? 4) How long is the user count allowed to remain below that threshold, before the code is removed? > By your reasoning, any bug would go unreported for years, no matter how > big the user group is. That is patently false. No, those are your words, not mine. > It's most commonly a combination of how hard it is to hit, and how many > can potentially hit it. Yes, some people will work around a bug, but > others will not. Hence a subset of people that hit it will report it. > Decades of bug reports have proven this to be true on my end. > I agree that a bug report count can be a proxy for a user count, but there is always a confidence level attached to such statistical reasoning, which can and should be quantified. > Nobody has reported the ataflop issue in 3 years. Either these people > never upgrade (which may be true), or none of them are using ataflop. > It's as simple as that. > It is when you over-simplify. The mere fact that Michael is working on this driver publicly will probably increase its user base. I think you and I both know that code with non-zero user count regularly gets removed. I think the main criterion for keeping code around has always been the expense. So I help with API modernization for the drivers I'm responsible for, to make them cheaper to keep around. Other people concerned about the cost of keeping code in the tree should look at drivers which only work on devices with vendor kernels. And they should consider the size of those drivers. When kernel.org has dropped all the code in that category, then sure, let's worry about a few tiny old legacy drivers.
On 10/18/21 6:14 PM, Finn Thain wrote: > On Mon, 18 Oct 2021, Jens Axboe wrote: > >> >> Oh please, can we skip the empty words, this is tiresome and >> unproductive. Since you apparently have a much better grasp on this than >> I do, answer me this: >> >> 1) How many users of ataflop are there? >> >> 2) How big of a subset of that group are capable of figuring out where >> to send a bug report? >> > > Both good questions. Here are some more. > > 3) How many users is sufficient to justify the cost of keeping ataflop > around? > > 4) How long is the user count allowed to remain below that threshold, > before the code is removed? I'm not interested in bot conversations. EOD.
Hi Jens, On Tue, Oct 19, 2021 at 12:40 PM Jens Axboe <axboe@kernel.dk> wrote: > >> 'last' is set if it's the last of a sequence of ->queue_rq() calls. If > >> you just do sync IO, then last is always set, as there is no sequence. > >> It's not hard to generate sequences, but on a floppy with basically no > >> queue depth the most you'd ever get is 2. You could try and set: > >> > >> /sys/block/<dev>/queue/max_sectors_kb > >> > >> to 4 for example, and then do something that generates a larger than 4k > >> write or read. Ideally that should give you more than 1. > > > > Thanks, tried that - that does indeed cause multiple requests queued to > > the driver (which rejects them promptly). > > > > Now fails because ataflop_commit_rqs() unconditionally calls > > finish_fdc() right after the first request started processing- and > > promptly wipes it again. > > > > What is the purpose of .commit_rqs? The PC legacy floppy driver doesn't > > use it ... > > You only need to care about bd->last if you have something in the driver > that can make it cheaper to commit more than one request. An example is > a driver that fills in requests, and then has an operation to ring the > submission doorbell to flush them out. The latter is what ->commit_rqs > is for. OK, that's indeed a no-op for our floppy driver, which can queue exactly one request. > For a floppy driver, just ignore bd->last and don't implement > commit_rqs, I don't think we're squeezing a lot of extra efficiency out > of it through that! Think many hundreds of thousands of IOPS or millions > of IOPS, not a handful of IOPS or less. I'm not averse to using bd->last to close down only after the last request in a sequence if it can be done safely (i.e. the requests that had been rejected are then promptly requeued). But complexity is the enemy of maintainability, so the nice and easy fix should be enough. I'll respin and send another version shortly. Cheers, Michael
On 10/18/21 6:42 PM, Michael Schmitz wrote: > Hi Jens, > > On Tue, Oct 19, 2021 at 12:40 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> 'last' is set if it's the last of a sequence of ->queue_rq() calls. If >>>> you just do sync IO, then last is always set, as there is no sequence. >>>> It's not hard to generate sequences, but on a floppy with basically no >>>> queue depth the most you'd ever get is 2. You could try and set: >>>> >>>> /sys/block/<dev>/queue/max_sectors_kb >>>> >>>> to 4 for example, and then do something that generates a larger than 4k >>>> write or read. Ideally that should give you more than 1. >>> >>> Thanks, tried that - that does indeed cause multiple requests queued to >>> the driver (which rejects them promptly). >>> >>> Now fails because ataflop_commit_rqs() unconditionally calls >>> finish_fdc() right after the first request started processing- and >>> promptly wipes it again. >>> >>> What is the purpose of .commit_rqs? The PC legacy floppy driver doesn't >>> use it ... >> >> You only need to care about bd->last if you have something in the driver >> that can make it cheaper to commit more than one request. An example is >> a driver that fills in requests, and then has an operation to ring the >> submission doorbell to flush them out. The latter is what ->commit_rqs >> is for. > > OK, that's indeed a no-op for our floppy driver, which can queue > exactly one request. Right, and the only reason the depth is set to 2 is to allow one for merging purposes. >> For a floppy driver, just ignore bd->last and don't implement >> commit_rqs, I don't think we're squeezing a lot of extra efficiency out >> of it through that! Think many hundreds of thousands of IOPS or millions >> of IOPS, not a handful of IOPS or less. > > I'm not averse to using bd->last to close down only after the last > request in a sequence if it can be done safely (i.e. the requests that > had been rejected are then promptly requeued). But complexity is the > enemy of maintainability, so the nice and easy fix should be enough. With just 2 requests, any sequence is going to be pretty limited :-). My recommendation would be to just ignore bd->last and treat any request as a standalone unit. Should make for easier code too, and you won't have two different cases to handle. > I'll respin and send another version shortly. Great, thanks.
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c index 96e2abe34a72..95469b4a8f78 100644 --- a/drivers/block/ataflop.c +++ b/drivers/block/ataflop.c @@ -653,8 +653,10 @@ static inline void copy_buffer(void *from, void *to) *p2++ = *p1++; } +/* finish_fdc() handling */ - +static void (*fdc_finish_action)( void ) = NULL; + /* General Interrupt Handling */ @@ -1228,6 +1230,12 @@ static void fd_rwsec_done1(int status) } else { /* all sectors finished */ + void (*handler)( void ); + + handler = xchg(&fdc_finish_action, NULL); + if (handler) + handler(); + fd_end_request_cur(BLK_STS_OK); } return; @@ -1391,6 +1399,11 @@ static void finish_fdc_done( int dummy ) DPRINT(("finish_fdc() finished\n")); } +static void queue_finish_fdc( void ) +{ + fdc_finish_action = finish_fdc; +} + /* The detection of disk changes is a dark chapter in Atari history :-( * Because the "Drive ready" signal isn't present in the Atari * hardware, one has to rely on the "Write Protect". This works fine, @@ -1491,6 +1504,8 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, int drive = floppy - unit; int type = floppy->type; + DPRINT(("Queue request: drive %d type %d\n", drive, type)); + spin_lock_irq(&ataflop_lock); if (fd_request) { spin_unlock_irq(&ataflop_lock); @@ -1551,7 +1566,7 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, do_fd_action( drive ); if (bd->last) - finish_fdc(); + queue_finish_fdc(); atari_enable_irq( IRQ_MFP_FDC ); out:
Refactoring of the Atari floppy driver when converting to blk-mq has broken the state machine in not-so-subtle ways: finish_fdc() must be called when operations on the floppy device have completed. This is crucial in order to relase the ST-DMA lock, which protects against concurrent access to the ST-DMA controller by other drivers (some DMA related, most just related to device register access - broken beyond compare, I know). When rewriting the drivers' old do_request() function, the fact that finish_fdc() was called only when all queued requests had completed appears to have been overlooked. Instead, the new request function calls finish_fdc() immediately after the last request has been queued. finish_fdc() executes a dummy seek after most requests, and this overwrites the state machine's interrupt hander that was set up to wait for completion of the read/write request just prior. To make matters worse, finish_fdc() is called before device interrupts are re-enabled, making certain that the read/write interupt is missed. Shifting the finish_fdc() call into the read/write request completion handler ensures the driver waits for the request to actually complete. Testing this change, I've only ever seen single sector requests with the 'last' flag set. If there is a way to send requests to the driver without that flag set, I'd appreciate a hint. As it now stands, the driver won't release the ST-DMA lock on requests that don't have this flag set, but won't accept further requests because the attempt to acquire the already-held lock once more will fail. Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Fixes: 6ec3938cff95f (ataflop: convert to blk-mq) CC: linux-block@vger.kernel.org CC: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> --- drivers/block/ataflop.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)