diff mbox

[v3,0/4] g_NCR5380: PDMA fixes and cleanup

Message ID 201706270828.40336.linux@rainbow-software.org (mailing list archive)
State Changes Requested, archived
Headers show

Commit Message

Ondrej Zary June 27, 2017, 6:28 a.m. UTC
On Monday 26 June 2017, Ondrej Zary wrote:
> On Monday 26 June 2017 09:30:33 Finn Thain wrote:
> > Ondrej, would you please test this new series?
> >
> > Changed since v1:
> > - PDMA transfer residual is calculated earlier.
> > - End of DMA flag check is now polled (if there is any residual).
> >
> > Changed since v2:
> > - Bail out of transfer loops when Gated IRQ gets asserted.
> > - Make udelay conditional on board type.
> > - Drop sg_tablesize patch due to performance regression.
> >
> >
> > Finn Thain (1):
> >   g_NCR5380: Cleanup comments and whitespace
> >
> > Ondrej Zary (3):
> >   g_NCR5380: Fix PDMA transfer size
> >   g_NCR5380: End PDMA transfer correctly on target disconnection
> >   g_NCR5380: Re-work PDMA loops
> >
> >  drivers/scsi/g_NCR5380.c | 239
> > +++++++++++++++++++++++------------------------ 1 file changed, 116
> > insertions(+), 123 deletions(-)

BTW. I've probably found the DTC write corruption.
Added the following check (13 is host buffer index register) - and it triggers 
sometimes: the value is 1 instead of 0. As we use only 16-bit writes, I don't 
see how the value could ever be odd. Looks like a bug in the chip.
The index register corrupts during the transfer, not after IRQ or timeout. The 
same check at beginning of pwrite() did not trigger.

The index register is not writable so we must(?) reset the PDMA engine to 
recover. However, this quick attempt to fix does not work, maybe we should 
reload the block count and continue?

                                                        src + start, 64);

Comments

Michael Schmitz June 27, 2017, 8:30 a.m. UTC | #1
Ondrej,

could this be a partial write (target did not transfer the last byte)?

One would suppose the chip posts a phase mismatch in that case ...

Cheers,

	Michael


Am 27.06.2017 um 18:28 schrieb Ondrej Zary:
> On Monday 26 June 2017, Ondrej Zary wrote:
>> On Monday 26 June 2017 09:30:33 Finn Thain wrote:
>>> Ondrej, would you please test this new series?
>>>
>>> Changed since v1:
>>> - PDMA transfer residual is calculated earlier.
>>> - End of DMA flag check is now polled (if there is any residual).
>>>
>>> Changed since v2:
>>> - Bail out of transfer loops when Gated IRQ gets asserted.
>>> - Make udelay conditional on board type.
>>> - Drop sg_tablesize patch due to performance regression.
>>>
>>>
>>> Finn Thain (1):
>>>   g_NCR5380: Cleanup comments and whitespace
>>>
>>> Ondrej Zary (3):
>>>   g_NCR5380: Fix PDMA transfer size
>>>   g_NCR5380: End PDMA transfer correctly on target disconnection
>>>   g_NCR5380: Re-work PDMA loops
>>>
>>>  drivers/scsi/g_NCR5380.c | 239
>>> +++++++++++++++++++++++------------------------ 1 file changed, 116
>>> insertions(+), 123 deletions(-)
> 
> BTW. I've probably found the DTC write corruption.
> Added the following check (13 is host buffer index register) - and it triggers 
> sometimes: the value is 1 instead of 0. As we use only 16-bit writes, I don't 
> see how the value could ever be odd. Looks like a bug in the chip.
> The index register corrupts during the transfer, not after IRQ or timeout. The 
> same check at beginning of pwrite() did not trigger.
> 
> The index register is not writable so we must(?) reset the PDMA engine to 
> recover. However, this quick attempt to fix does not work, maybe we should 
> reload the block count and continue?
> 
> --- a/drivers/scsi/g_NCR5380.c
> +++ b/drivers/scsi/g_NCR5380.c
> @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct 
> NCR5380_hostdata *hostdata,
>                                 goto out_wait;
>                         }
>                 }
> -
> +               idx = NCR5380_read(13);
> +               if (idx != 0) {
> +                       printk("host idx=%d, start=%d\n", idx, start);
> +                       NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> +                       NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> +                       goto out_wait;
> +               }
>                 if (hostdata->io_port && hostdata->io_width == 2)
>                         outsw(hostdata->io_port + hostdata->c400_host_buf,
>                                                         src + start, 64);
> 
>
Finn Thain June 27, 2017, 12:42 p.m. UTC | #2
On Tue, 27 Jun 2017, Ondrej Zary wrote:

> BTW. I've probably found the DTC write corruption. Added the following 
> check (13 is host buffer index register) -

That register is not mentioned in my 53c400 datasheet.

> and it triggers sometimes: the value is 1 instead of 0. As we use only 
> 16-bit writes, I don't see how the value could ever be odd. Looks like a 
> bug in the chip. The index register corrupts during the transfer, not 
> after IRQ or timeout. The same check at beginning of pwrite() did not 
> trigger.
> 

Are you reading this register at the right moment? Have you tried waiting 
for it to reach zero, as in,

	if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
		/* printk, reset etc */;

Even if this is a reliable way to detect a short transfer, it would be 
nice to know the root cause. But I'm being unrealistic: the DTC436 vendor 
never responded to my requests for technical documentation.

> The index register is not writable so we must(?) reset the PDMA engine 
> to recover. However, this quick attempt to fix does not work, maybe we 
> should reload the block count and continue?
> 

I don't know if it is possible to recover. If the last byte never reached 
the scsi bus, then once you reset the 53c400 core, you need the driver to 
perform a single-byte PIO transfer after the short PDMA transfer. This 
would require that you set the residual appropriately (though in my 
experience that may not be sufficient).

It may be better to simply limit the transfer to 512 bytes instead of 
attempting to recover based on an undocumented (?) register, etc. Seems 
like a bit of a hack.

> --- a/drivers/scsi/g_NCR5380.c
> +++ b/drivers/scsi/g_NCR5380.c
> @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct 
> NCR5380_hostdata *hostdata,
>                                 goto out_wait;
>                         }
>                 }
> -
> +               idx = NCR5380_read(13);
> +               if (idx != 0) {
> +                       printk("host idx=%d, start=%d\n", idx, start);
> +                       NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> +                       NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> +                       goto out_wait;
> +               }
>                 if (hostdata->io_port && hostdata->io_width == 2)
>                         outsw(hostdata->io_port + hostdata->c400_host_buf,
>                                                         src + start, 64);
> 

I find it hard to reason about this code. For example, out_wait is to be 
removed. Let's get the preceding patches working and signed-off. Please go 
ahead and use a 512 B transfer for DTC436 testing if that will help get 
this patch series over the line.

--
Finn Thain June 27, 2017, 12:54 p.m. UTC | #3
On Tue, 27 Jun 2017, Michael Schmitz wrote:

> Ondrej,
> 
> could this be a partial write (target did not transfer the last byte)?
> 

We do wait for TCR_LAST_BYTE_SENT, but only when there is no residual.

Perhaps we should wait for TCR_LAST_BYTE_SENT whenever the 53c400 asserts 
/EOP. That is, whenever BASR_END_DMA_TRANSFER is set.
Ondrej Zary June 27, 2017, 4:06 p.m. UTC | #4
On Tuesday 27 June 2017 14:42:29 Finn Thain wrote:
> On Tue, 27 Jun 2017, Ondrej Zary wrote:
> > BTW. I've probably found the DTC write corruption. Added the following
> > check (13 is host buffer index register) -
>
> That register is not mentioned in my 53c400 datasheet.

Yes, it's not there. But we don't have 53C400A and DTC436 datasheets (this 
register works on both).

> > and it triggers sometimes: the value is 1 instead of 0. As we use only
> > 16-bit writes, I don't see how the value could ever be odd. Looks like a
> > bug in the chip. The index register corrupts during the transfer, not
> > after IRQ or timeout. The same check at beginning of pwrite() did not
> > trigger.
>
> Are you reading this register at the right moment? Have you tried waiting
> for it to reach zero, as in,
>
> 	if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
> 		/* printk, reset etc */;

I have not but will try (expecting that it will not change by itself).

> Even if this is a reliable way to detect a short transfer, it would be
> nice to know the root cause. But I'm being unrealistic: the DTC436 vendor
> never responded to my requests for technical documentation.

According to the data corruption observed, it's not a short transfer. The 
corruption is always the same: one byte missing at the beginning of a 128 B 
block. It happens only with slow Quantum LPS 240 drive, not with faster IBM 
DORS-32160.

> > The index register is not writable so we must(?) reset the PDMA engine
> > to recover. However, this quick attempt to fix does not work, maybe we
> > should reload the block count and continue?
>
> I don't know if it is possible to recover. If the last byte never reached
> the scsi bus, then once you reset the 53c400 core, you need the driver to
> perform a single-byte PIO transfer after the short PDMA transfer. This
> would require that you set the residual appropriately (though in my
> experience that may not be sufficient).
>
> It may be better to simply limit the transfer to 512 bytes instead of
> attempting to recover based on an undocumented (?) register, etc. Seems
> like a bit of a hack.
>
> > --- a/drivers/scsi/g_NCR5380.c
> > +++ b/drivers/scsi/g_NCR5380.c
> > @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct
> > NCR5380_hostdata *hostdata,
> >                                 goto out_wait;
> >                         }
> >                 }
> > -
> > +               idx = NCR5380_read(13);
> > +               if (idx != 0) {
> > +                       printk("host idx=%d, start=%d\n", idx, start);
> > +                       NCR5380_write(hostdata->c400_ctl_status,
> > CSR_RESET); +                      
> > NCR5380_write(hostdata->c400_ctl_status, CSR_BASE); +                    
> >   goto out_wait;
> > +               }
> >                 if (hostdata->io_port && hostdata->io_width == 2)
> >                         outsw(hostdata->io_port +
> > hostdata->c400_host_buf, src + start, 64);
>
> I find it hard to reason about this code. For example, out_wait is to be
> removed. Let's get the preceding patches working and signed-off. Please go
> ahead and use a 512 B transfer for DTC436 testing if that will help get
> this patch series over the line.

OK, I agree. Let's fix the problems first and leave this hack for later.
Finn Thain June 28, 2017, 4:10 a.m. UTC | #5
On Tue, 27 Jun 2017, Ondrej Zary wrote:

> On Tuesday 27 June 2017 14:42:29 Finn Thain wrote:
> 
> > > ... it triggers sometimes: the value is 1 instead of 0. As we use 
> > > only 16-bit writes, I don't see how the value could ever be odd. 
> > > Looks like a bug in the chip. The index register corrupts during the 
> > > transfer, not after IRQ or timeout. The same check at beginning of 
> > > pwrite() did not trigger.
> >
> > Are you reading this register at the right moment? Have you tried 
> > waiting for it to reach zero, as in,
> >
> > 	if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
> > 		/* printk, reset etc */;
> 
> I have not but will try (expecting that it will not change by itself).
> 

Now that I know that it is the byte at the beginning of the block that 
went missing, I agree that there's no point waiting for the byte count to 
change.

I've included a patch with your 512 B limit in v4.

Thanks.

> > Even if this is a reliable way to detect a short transfer, it would be 
> > nice to know the root cause. But I'm being unrealistic: the DTC436 
> > vendor never responded to my requests for technical documentation.
> 
> According to the data corruption observed, it's not a short transfer. 
> The corruption is always the same: one byte missing at the beginning of 
> a 128 B block. It happens only with slow Quantum LPS 240 drive, not with 
> faster IBM DORS-32160.
> 

--
diff mbox

Patch

--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -595,7 +603,13 @@  static inline int generic_NCR5380_pwrite(struct 
NCR5380_hostdata *hostdata,
                                goto out_wait;
                        }
                }
-
+               idx = NCR5380_read(13);
+               if (idx != 0) {
+                       printk("host idx=%d, start=%d\n", idx, start);
+                       NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
+                       NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
+                       goto out_wait;
+               }
                if (hostdata->io_port && hostdata->io_width == 2)
                        outsw(hostdata->io_port + hostdata->c400_host_buf,