mbox series

[00/15] s390: vfio-ccw dasd ipl support

Message ID 1544623878-11248-1-git-send-email-jjherne@linux.ibm.com (mailing list archive)
Headers show
Series s390: vfio-ccw dasd ipl support | expand

Message

Jason J. Herne Dec. 12, 2018, 2:11 p.m. UTC
This is to support booting from vfio-ccw dasd devices. We basically implement
the real hardware ipl procedure. This allows for booting Linux guests on
vfio-ccw devices.

vfio-ccw's channel program prefetch algorithm complicates ipl because most ipl
channel programs dynamically modify themselves. Details on the ipl process and
how we worked around this issue can be found in docs/devel/s390-dasd-ipl.txt.

NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
On subsystem reset (I see this right after host ipl) we sometimes end up getting
an unexpected unit check status from a dasd device. This causes the first start
subchannel instruction to fail due to the pending unit check status. My solution
to this problem, as advised by the kernel folks, is to simply retry my ssch
instructions before declaring failure when unexpected unit checks happen. In the
event of a persistent error, after two retries we'll give up and print some
useful error info for the user.

Changelog
==========

rfc -> v1
---------
- Retry start subchannel operations on unexpected unit check status
- Print more complete state information on i/o error
- Fixed asm constraints (Q -> QS)
- Align lowcore to 8k boundary
- Initialize next_cpa in run_dynamic_ccw_program to avoid compiler warning
- Cleanup: Rework handling of i/o interrupts in assembler
- Fix netboot image (include cio.o in make file)
- Misc cleanups

Jason J. Herne (15):
  s390 vfio-ccw: Add bootindex property and IPLB data
  s390-bios: decouple cio setup from virtio
  s390-bios: decouple common boot logic from virtio
  s390-bios: Extend find_dev() for non-virtio devices
  s390-bios: Factor finding boot device out of virtio code path
  s390-bios: Clean up cio.h
  s390-bios: Decouple channel i/o logic from virtio
  s390-bios: Map low core memory
  s390-bios: ptr2u32 and u32toptr
  s390-bios: Support for running format-0/1 channel programs
  s390-bios: cio error handling
  s390-bios: Refactor virtio to run channel programs via cio
  s390-bios: Use control unit type to determine boot method
  s390-bios: Add channel command codes/structs needed for dasd-ipl
  s390-bios: Support booting from real dasd device

 docs/devel/s390-dasd-ipl.txt     | 132 ++++++++++++++
 hw/s390x/ipl.c                   |  15 ++
 hw/s390x/s390-ccw.c              |   9 +
 hw/vfio/ccw.c                    |  13 +-
 include/hw/s390x/s390-ccw.h      |   1 +
 include/hw/s390x/vfio-ccw.h      |  38 ++++
 pc-bios/s390-ccw/Makefile        |   2 +-
 pc-bios/s390-ccw/cio.c           | 374 +++++++++++++++++++++++++++++++++++++++
 pc-bios/s390-ccw/cio.h           | 236 +++++++++++++++++++-----
 pc-bios/s390-ccw/dasd-ipl.c      | 249 ++++++++++++++++++++++++++
 pc-bios/s390-ccw/dasd-ipl.h      |  16 ++
 pc-bios/s390-ccw/libc.h          |  23 +++
 pc-bios/s390-ccw/main.c          | 161 +++++++++++------
 pc-bios/s390-ccw/netboot.mak     |   2 +-
 pc-bios/s390-ccw/netmain.c       |   1 +
 pc-bios/s390-ccw/s390-arch.h     | 113 ++++++++++++
 pc-bios/s390-ccw/s390-ccw.h      |  10 +-
 pc-bios/s390-ccw/start.S         |  33 +++-
 pc-bios/s390-ccw/virtio-blkdev.c |   1 +
 pc-bios/s390-ccw/virtio.c        |  46 +----
 roms/SLOF                        |   2 +-
 tests/boot-serial-test.c         |   2 +-
 22 files changed, 1308 insertions(+), 171 deletions(-)
 create mode 100644 docs/devel/s390-dasd-ipl.txt
 create mode 100644 include/hw/s390x/vfio-ccw.h
 create mode 100644 pc-bios/s390-ccw/cio.c
 create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
 create mode 100644 pc-bios/s390-ccw/dasd-ipl.h
 create mode 100644 pc-bios/s390-ccw/s390-arch.h

--
2.7.4

Comments

Cornelia Huck Dec. 12, 2018, 2:34 p.m. UTC | #1
On Wed, 12 Dec 2018 09:11:03 -0500
"Jason J. Herne" <jjherne@linux.ibm.com> wrote:

Hm, I think you need to adjust your cc: list. I added some more folks
(and removed Dong Jia, whose address is no longer valid AFAIK).

> This is to support booting from vfio-ccw dasd devices. We basically implement
> the real hardware ipl procedure. This allows for booting Linux guests on
> vfio-ccw devices.
> 
> vfio-ccw's channel program prefetch algorithm complicates ipl because most ipl
> channel programs dynamically modify themselves. Details on the ipl process and
> how we worked around this issue can be found in docs/devel/s390-dasd-ipl.txt.
> 
> NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
> On subsystem reset (I see this right after host ipl) we sometimes end up getting
> an unexpected unit check status from a dasd device. This causes the first start
> subchannel instruction to fail due to the pending unit check status. My solution
> to this problem, as advised by the kernel folks, is to simply retry my ssch
> instructions before declaring failure when unexpected unit checks happen. In the
> event of a persistent error, after two retries we'll give up and print some
> useful error info for the user.

So, is that a status we only see because the vfio-ccw driver keeps the
subchannel enabled (as by the other recent thread)?

Is there any value in distinguishing different unit checks, or is retry
the best strategy in any case?

> 
> Changelog
> ==========
> 
> rfc -> v1
> ---------
> - Retry start subchannel operations on unexpected unit check status
> - Print more complete state information on i/o error
> - Fixed asm constraints (Q -> QS)
> - Align lowcore to 8k boundary
> - Initialize next_cpa in run_dynamic_ccw_program to avoid compiler warning
> - Cleanup: Rework handling of i/o interrupts in assembler
> - Fix netboot image (include cio.o in make file)
> - Misc cleanups
> 
> Jason J. Herne (15):
>   s390 vfio-ccw: Add bootindex property and IPLB data
>   s390-bios: decouple cio setup from virtio
>   s390-bios: decouple common boot logic from virtio
>   s390-bios: Extend find_dev() for non-virtio devices
>   s390-bios: Factor finding boot device out of virtio code path
>   s390-bios: Clean up cio.h
>   s390-bios: Decouple channel i/o logic from virtio
>   s390-bios: Map low core memory
>   s390-bios: ptr2u32 and u32toptr
>   s390-bios: Support for running format-0/1 channel programs
>   s390-bios: cio error handling
>   s390-bios: Refactor virtio to run channel programs via cio
>   s390-bios: Use control unit type to determine boot method
>   s390-bios: Add channel command codes/structs needed for dasd-ipl
>   s390-bios: Support booting from real dasd device
> 
>  docs/devel/s390-dasd-ipl.txt     | 132 ++++++++++++++
>  hw/s390x/ipl.c                   |  15 ++
>  hw/s390x/s390-ccw.c              |   9 +
>  hw/vfio/ccw.c                    |  13 +-
>  include/hw/s390x/s390-ccw.h      |   1 +
>  include/hw/s390x/vfio-ccw.h      |  38 ++++
>  pc-bios/s390-ccw/Makefile        |   2 +-
>  pc-bios/s390-ccw/cio.c           | 374 +++++++++++++++++++++++++++++++++++++++
>  pc-bios/s390-ccw/cio.h           | 236 +++++++++++++++++++-----
>  pc-bios/s390-ccw/dasd-ipl.c      | 249 ++++++++++++++++++++++++++
>  pc-bios/s390-ccw/dasd-ipl.h      |  16 ++
>  pc-bios/s390-ccw/libc.h          |  23 +++
>  pc-bios/s390-ccw/main.c          | 161 +++++++++++------
>  pc-bios/s390-ccw/netboot.mak     |   2 +-
>  pc-bios/s390-ccw/netmain.c       |   1 +
>  pc-bios/s390-ccw/s390-arch.h     | 113 ++++++++++++
>  pc-bios/s390-ccw/s390-ccw.h      |  10 +-
>  pc-bios/s390-ccw/start.S         |  33 +++-
>  pc-bios/s390-ccw/virtio-blkdev.c |   1 +
>  pc-bios/s390-ccw/virtio.c        |  46 +----
>  roms/SLOF                        |   2 +-
>  tests/boot-serial-test.c         |   2 +-
>  22 files changed, 1308 insertions(+), 171 deletions(-)
>  create mode 100644 docs/devel/s390-dasd-ipl.txt
>  create mode 100644 include/hw/s390x/vfio-ccw.h
>  create mode 100644 pc-bios/s390-ccw/cio.c
>  create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
>  create mode 100644 pc-bios/s390-ccw/dasd-ipl.h
>  create mode 100644 pc-bios/s390-ccw/s390-arch.h
> 
> --
> 2.7.4
>
Jason J. Herne Dec. 12, 2018, 2:47 p.m. UTC | #2
On 12/12/18 9:34 AM, Cornelia Huck wrote:
> On Wed, 12 Dec 2018 09:11:03 -0500
> "Jason J. Herne" <jjherne@linux.ibm.com> wrote:
> 
> Hm, I think you need to adjust your cc: list. I added some more folks
> (and removed Dong Jia, whose address is no longer valid AFAIK).
> 

Correct. I forgot to update my list before I sent.


>> NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
>> On subsystem reset (I see this right after host ipl) we sometimes end up getting
>> an unexpected unit check status from a dasd device. This causes the first start
>> subchannel instruction to fail due to the pending unit check status. My solution
>> to this problem, as advised by the kernel folks, is to simply retry my ssch
>> instructions before declaring failure when unexpected unit checks happen. In the
>> event of a persistent error, after two retries we'll give up and print some
>> useful error info for the user.
> 
> So, is that a status we only see because the vfio-ccw driver keeps the
> subchannel enabled (as by the other recent thread)?
> 
> Is there any value in distinguishing different unit checks, or is retry
> the best strategy in any case?
> 

It is status only, yes. I'm not sure if there is value in treating different unit checks 
differently. I discussed the problem with some of the kernel i/o devs and the suggestion I 
got was to simply retry. In the event of a real I/O error I doubt there is much we'd be 
able to do to recover so I think showing the user all of the relevant info (see patch 
s390-bios: cio error handling) and exiting is the right thing to do.
no-reply@patchew.org Dec. 12, 2018, 8:28 p.m. UTC | #3
Patchew URL: https://patchew.org/QEMU/1544623878-11248-1-git-send-email-jjherne@linux.ibm.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1544623878-11248-1-git-send-email-jjherne@linux.ibm.com
Subject: [Qemu-devel] [PATCH 00/15] s390: vfio-ccw dasd ipl support

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
7bc3233 s390-bios: Support booting from real dasd device
f2df85e s390-bios: Add channel command codes/structs needed for dasd-ipl
1fa1e47 s390-bios: Use control unit type to determine boot method
364ad22 s390-bios: Refactor virtio to run channel programs via cio
c3129f5 s390-bios: cio error handling
f33018f s390-bios: Support for running format-0/1 channel programs
3e79f11 s390-bios: ptr2u32 and u32toptr
00376ab s390-bios: Map low core memory
c1313ca s390-bios: Decouple channel i/o logic from virtio
98f3522 s390-bios: Clean up cio.h
3ff76c4 s390-bios: Factor finding boot device out of virtio code path
80e0855 s390-bios: Extend find_dev() for non-virtio devices
2127e05 s390-bios: decouple common boot logic from virtio
e47231c s390-bios: decouple cio setup from virtio
67265dd s390 vfio-ccw: Add bootindex property and IPLB data

=== OUTPUT BEGIN ===
Checking PATCH 1/15: s390 vfio-ccw: Add bootindex property and IPLB data...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#131: 
new file mode 100644

total: 0 errors, 1 warnings, 133 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 2/15: s390-bios: decouple cio setup from virtio...
Checking PATCH 3/15: s390-bios: decouple common boot logic from virtio...
ERROR: externs should be avoided in .c files
#28: FILE: pc-bios/s390-ccw/main.c:19:
+IplParameterBlock iplb __attribute__((__aligned__(PAGE_SIZE)));

total: 1 errors, 0 warnings, 65 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/15: s390-bios: Extend find_dev() for non-virtio devices...
Checking PATCH 5/15: s390-bios: Factor finding boot device out of virtio code path...
Checking PATCH 6/15: s390-bios: Clean up cio.h...
ERROR: spaces prohibited around that ':' (ctx:WxW)
#27: FILE: pc-bios/s390-ccw/cio.h:56:
+    __u32 isc    : 3;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#28: FILE: pc-bios/s390-ccw/cio.h:57:
+    __u32 ena    : 1;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#29: FILE: pc-bios/s390-ccw/cio.h:58:
+    __u32 mme    : 2;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#30: FILE: pc-bios/s390-ccw/cio.h:59:
+    __u32 mp     : 1;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#31: FILE: pc-bios/s390-ccw/cio.h:60:
+    __u32 csense : 1;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#32: FILE: pc-bios/s390-ccw/cio.h:61:
+    __u32 mbfc   : 1;
                  ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#82: FILE: pc-bios/s390-ccw/cio.h:111:
+    __u32 reserved5 : 4;
                     ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#83: FILE: pc-bios/s390-ccw/cio.h:112:
+    __u32 format2   : 4;
                     ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#84: FILE: pc-bios/s390-ccw/cio.h:113:
+    __u32 reserved6 : 24;
                     ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#130: FILE: pc-bios/s390-ccw/cio.h:167:
+    __u32 key  : 4;   /* flags, like key, suspend control, etc. */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#131: FILE: pc-bios/s390-ccw/cio.h:168:
+    __u32 spnd : 1;   /* suspend control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#132: FILE: pc-bios/s390-ccw/cio.h:169:
+    __u32 res1 : 1;   /* reserved */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#133: FILE: pc-bios/s390-ccw/cio.h:170:
+    __u32 mod  : 1;   /* modification control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#134: FILE: pc-bios/s390-ccw/cio.h:171:
+    __u32 sync : 1;   /* synchronize control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#135: FILE: pc-bios/s390-ccw/cio.h:172:
+    __u32 fmt  : 1;   /* format control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#136: FILE: pc-bios/s390-ccw/cio.h:173:
+    __u32 pfch : 1;   /* prefetch control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#137: FILE: pc-bios/s390-ccw/cio.h:174:
+    __u32 isic : 1;   /* initial-status-interruption control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#138: FILE: pc-bios/s390-ccw/cio.h:175:
+    __u32 alcc : 1;   /* address-limit-checking control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#139: FILE: pc-bios/s390-ccw/cio.h:176:
+    __u32 ssic : 1;   /* suppress-suspended-interr. control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#140: FILE: pc-bios/s390-ccw/cio.h:177:
+    __u32 res2 : 1;   /* reserved */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#141: FILE: pc-bios/s390-ccw/cio.h:178:
+    __u32 c64  : 1;   /* IDAW/QDIO 64 bit control  */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#142: FILE: pc-bios/s390-ccw/cio.h:179:
+    __u32 i2k  : 1;   /* IDAW 2/4kB block size control */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#143: FILE: pc-bios/s390-ccw/cio.h:180:
+    __u32 lpm  : 8;   /* logical path mask */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#144: FILE: pc-bios/s390-ccw/cio.h:181:
+    __u32 ils  : 1;   /* incorrect length */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#145: FILE: pc-bios/s390-ccw/cio.h:182:
+    __u32 zero : 6;   /* reserved zeros */
                ^

ERROR: spaces prohibited around that ':' (ctx:WxW)
#146: FILE: pc-bios/s390-ccw/cio.h:183:
+    __u32 orbx : 1;   /* ORB extension control */
                ^

total: 26 errors, 0 warnings, 171 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 7/15: s390-bios: Decouple channel i/o logic from virtio...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#27: 
new file mode 100644

total: 0 errors, 1 warnings, 127 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 8/15: s390-bios: Map low core memory...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#34: 
new file mode 100644

total: 0 errors, 1 warnings, 114 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 9/15: s390-bios: ptr2u32 and u32toptr...
Checking PATCH 10/15: s390-bios: Support for running format-0/1 channel programs...
Checking PATCH 11/15: s390-bios: cio error handling...
Checking PATCH 12/15: s390-bios: Refactor virtio to run channel programs via cio...
Checking PATCH 13/15: s390-bios: Use control unit type to determine boot method...
Checking PATCH 14/15: s390-bios: Add channel command codes/structs needed for dasd-ipl...
Checking PATCH 15/15: s390-bios: Support booting from real dasd device...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
new file mode 100644

total: 0 errors, 1 warnings, 438 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1544623878-11248-1-git-send-email-jjherne@linux.ibm.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Jason J. Herne Jan. 8, 2019, 4:37 p.m. UTC | #4
On 12/12/18 9:34 AM, Cornelia Huck wrote:
...
>>
>> NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
>> On subsystem reset (I see this right after host ipl) we sometimes end up getting
>> an unexpected unit check status from a dasd device. This causes the first start
>> subchannel instruction to fail due to the pending unit check status. My solution
>> to this problem, as advised by the kernel folks, is to simply retry my ssch
>> instructions before declaring failure when unexpected unit checks happen. In the
>> event of a persistent error, after two retries we'll give up and print some
>> useful error info for the user.
> 
> So, is that a status we only see because the vfio-ccw driver keeps the
> subchannel enabled (as by the other recent thread)?
> 
> Is there any value in distinguishing different unit checks, or is retry
> the best strategy in any case?
> 
The status presents on device reset. So when the host kernel IPLs this status will be 
present. The very first attempt to use the device (SSCH, other instructions perhaps?) will 
cause this status to be presented. Sometimes the host kernel must "get there first" and 
clear the status. And other times the guest (by way of Qemu bios) gets there first.

The kernel handles unexpected unit checks by simply retrying a low number of times before 
giving up. Given that bios code is a constant frequency code path, and the kernel has 
already set this precedent, I feel safe with this decision and don't see a ton of value in 
doing much more. If we find a case that requires more handling we can take a look at it.
Halil Pasic Jan. 8, 2019, 5:36 p.m. UTC | #5
On Tue, 8 Jan 2019 11:37:56 -0500
"Jason J. Herne" <jjherne@linux.ibm.com> wrote:

> On 12/12/18 9:34 AM, Cornelia Huck wrote:
> ...
> >>
> >> NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
> >> On subsystem reset (I see this right after host ipl) we sometimes end up getting
> >> an unexpected unit check status from a dasd device. This causes the first start
> >> subchannel instruction to fail due to the pending unit check status. My solution
> >> to this problem, as advised by the kernel folks, is to simply retry my ssch
> >> instructions before declaring failure when unexpected unit checks happen. In the
> >> event of a persistent error, after two retries we'll give up and print some
> >> useful error info for the user.
> > 
> > So, is that a status we only see because the vfio-ccw driver keeps the
> > subchannel enabled (as by the other recent thread)?
> > 
> > Is there any value in distinguishing different unit checks, or is retry
> > the best strategy in any case?
> > 
> The status presents on device reset. So when the host kernel IPLs this status will be 
> present. The very first attempt to use the device (SSCH, other instructions perhaps?) will 
> cause this status to be presented. Sometimes the host kernel must "get there first" and 
> clear the status. And other times the guest (by way of Qemu bios) gets there first.
> 
> The kernel handles unexpected unit checks by simply retrying a low number of times before 
> giving up. Given that bios code is a constant frequency code path, and the kernel has 
> already set this precedent, I feel safe with this decision and don't see a ton of value in 
> doing much more. If we find a case that requires more handling we can take a look at it.
> 

I agree, doing elaborate CIO error handling here does not seem like a
particularly good idea.

Something remotely related -- let me play crazy for a moment: let's say
we pass-through two DASD's to a single guest, one as the IPL disk and
one just so. If I'm not mistaken, the guest is guaranteed to get this
special after reset unit check (let's say freshly constructed VM),
unless there is another OS messing with the same DASD maybe, at least
for the 'just son DASD'. I would even guess that the condition in
question is indicated even for the IPL-DASD (if we thing guest1).

But ccw-passthrough won't get perfect anyway. So I think we can ignore
this side effect of the reset, unless  a need arises not to.

Regards,
Halil
Cornelia Huck Jan. 9, 2019, 9:57 a.m. UTC | #6
On Tue, 8 Jan 2019 18:36:09 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Tue, 8 Jan 2019 11:37:56 -0500
> "Jason J. Herne" <jjherne@linux.ibm.com> wrote:
> 
> > On 12/12/18 9:34 AM, Cornelia Huck wrote:
> > ...  
> > >>
> > >> NOTE: It has been a while, but I've finally chased down my infamous "reset bug".
> > >> On subsystem reset (I see this right after host ipl) we sometimes end up getting
> > >> an unexpected unit check status from a dasd device. This causes the first start
> > >> subchannel instruction to fail due to the pending unit check status. My solution
> > >> to this problem, as advised by the kernel folks, is to simply retry my ssch
> > >> instructions before declaring failure when unexpected unit checks happen. In the
> > >> event of a persistent error, after two retries we'll give up and print some
> > >> useful error info for the user.  
> > > 
> > > So, is that a status we only see because the vfio-ccw driver keeps the
> > > subchannel enabled (as by the other recent thread)?
> > > 
> > > Is there any value in distinguishing different unit checks, or is retry
> > > the best strategy in any case?
> > >   
> > The status presents on device reset. So when the host kernel IPLs this status will be 
> > present. The very first attempt to use the device (SSCH, other instructions perhaps?) will 
> > cause this status to be presented. Sometimes the host kernel must "get there first" and 
> > clear the status. And other times the guest (by way of Qemu bios) gets there first.
> > 
> > The kernel handles unexpected unit checks by simply retrying a low number of times before 
> > giving up. Given that bios code is a constant frequency code path, and the kernel has 
> > already set this precedent, I feel safe with this decision and don't see a ton of value in 
> > doing much more. If we find a case that requires more handling we can take a look at it.

Yeah, my thinking was "should we check for this particular unit check
so we don't ignore other problems"? But if the kernel simply retries,
let's just do that in the bios as well.

> I agree, doing elaborate CIO error handling here does not seem like a
> particularly good idea.
> 
> Something remotely related -- let me play crazy for a moment: let's say
> we pass-through two DASD's to a single guest, one as the IPL disk and
> one just so. If I'm not mistaken, the guest is guaranteed to get this
> special after reset unit check (let's say freshly constructed VM),
> unless there is another OS messing with the same DASD maybe, at least
> for the 'just son DASD'. I would even guess that the condition in
> question is indicated even for the IPL-DASD (if we thing guest1).

My thinking is that the guest needs to be able to deal with this unit
check if it gets it, and chances are good that it already does the
right thing if the OS has been running on non-KVM as well. We probably
can neglect any differences between IPL and non-IPL, as the unit check
is simply something that the guest *might* see.

> 
> But ccw-passthrough won't get perfect anyway. So I think we can ignore
> this side effect of the reset, unless  a need arises not to.
> 
> Regards,
> Halil
> 
>