Message ID | 20200708185135.46694-3-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | s390x: initial support for virtio-mem | expand |
On Wed, 8 Jul 2020 20:51:32 +0200 David Hildenbrand <david@redhat.com> wrote: > Let's implement the "storage configuration" part of diag260. This diag > is found under z/VM, to indicate usable chunks of memory tot he guest OS. > As I don't have access to documentation, I have no clue what the actual > error cases are, and which other stuff we could eventually query using this > interface. Somebody with access to documentation should fix this. This > implementation seems to work with Linux guests just fine. > > The Linux kernel supports diag260 to query the available memory since > v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM > (with maxmem being defined and bigger than the memory size, e.g., "-m > 2G,maxmem=4G"), just as if support for SCLP storage information is not > implemented. They will fail to detect the actual initial memory size. > > This interface allows us to expose the maximum ramsize via sclp > and the initial ramsize via diag260 - without having to mess with the > memory increment size and having to align the initial memory size to it. > > This is a preparation for memory device support. We'll unlock the > implementation with a new QEMU machine that supports memory devices. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > target/s390x/diag.c | 57 ++++++++++++++++++++++++++++++++++++++ > target/s390x/internal.h | 2 ++ > target/s390x/kvm.c | 11 ++++++++ > target/s390x/misc_helper.c | 6 ++++ > target/s390x/translate.c | 4 +++ > 5 files changed, 80 insertions(+) > > diff --git a/target/s390x/diag.c b/target/s390x/diag.c > index 1a48429564..c3b1e24b2c 100644 > --- a/target/s390x/diag.c > +++ b/target/s390x/diag.c > @@ -23,6 +23,63 @@ > #include "hw/s390x/pv.h" > #include "kvm_s390x.h" > > +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + const ram_addr_t initial_ram_size = ms->ram_size; > + const uint64_t subcode = env->regs[r3]; > + S390CPU *cpu = env_archcpu(env); > + ram_addr_t addr, length; > + uint64_t tmp; > + > + /* TODO: Unlock with new QEMU machine. */ > + if (false) { > + s390_program_interrupt(env, PGM_OPERATION, ra); > + return; > + } > + > + /* > + * There also seems to be subcode "0xc", which stores the size of the > + * first chunk and the total size to r1/r2. It's only used by very old > + * Linux, so don't implement it. FWIW, https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf seems to list the available subcodes. Anything but 0xc and 0x10 is for 24/31 bit only, so we can safely ignore them. Not sure what we want to do with 0xc: it is supposed to "Return the highest addressable byte of virtual storage in the host-primary address space, including named saved systems and saved segments", so returning the end of the address space should be easy enough, but not very useful. > + */ > + if ((r1 & 1) || subcode != 0x10) { > + s390_program_interrupt(env, PGM_SPECIFICATION, ra); > + return; > + } > + addr = env->regs[r1]; > + length = env->regs[r1 + 1]; > + > + /* FIXME: Somebody with documentation should fix this. */ Doc mentioned above says for specification exception: "For subcode X'10': • Rx is not an even-numbered register. • The address contained in Rx is not on a quadword boundary. • The length contained in Rx+1 is not a positive multiple of 16." > + if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) { > + s390_program_interrupt(env, PGM_SPECIFICATION, ra); > + return; > + } > + > + /* FIXME: Somebody with documentation should fix this. */ > + if (!length) { Probably specification exception as well? > + setcc(cpu, 3); > + return; > + } > + > + /* FIXME: Somebody with documentation should fix this. */ For access exception: "For subcode X'10', an error occurred trying to store the extent information into the guest's output area." > + if (!address_space_access_valid(&address_space_memory, addr, length, true, > + MEMTXATTRS_UNSPECIFIED)) { > + s390_program_interrupt(env, PGM_ADDRESSING, ra); > + return; > + } > + > + /* Indicate our initial memory ([0 .. ram_size - 1]) */ > + tmp = cpu_to_be64(0); > + cpu_physical_memory_write(addr, &tmp, sizeof(tmp)); > + tmp = cpu_to_be64(initial_ram_size - 1); > + cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp)); > + > + /* Exactly one entry was stored. */ > + env->regs[r3] = 1; > + setcc(cpu, 0); > +} > + > int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3) > { > uint64_t func = env->regs[r1]; (...) > diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c > index 58dbc023eb..d7274eb320 100644 > --- a/target/s390x/misc_helper.c > +++ b/target/s390x/misc_helper.c > @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) > uint64_t r; > > switch (num) { > + case 0x260: > + qemu_mutex_lock_iothread(); > + handle_diag_260(env, r1, r3, GETPC()); > + qemu_mutex_unlock_iothread(); > + r = 0; > + break; > case 0x500: > /* KVM hypercall */ > qemu_mutex_lock_iothread(); Looking at the doc referenced above, it seems that we treat every diag call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated to your patch; maybe I'm misreading.) > diff --git a/target/s390x/translate.c b/target/s390x/translate.c > index 4f6f1e31cd..6bb8b6e513 100644 > --- a/target/s390x/translate.c > +++ b/target/s390x/translate.c > @@ -2398,6 +2398,10 @@ static DisasJumpType op_diag(DisasContext *s, DisasOps *o) > TCGv_i32 func_code = tcg_const_i32(get_field(s, i2)); > > gen_helper_diag(cpu_env, r1, r3, func_code); > + /* Only some diags modify the CC. */ > + if (get_field(s, i2) == 0x260) { > + set_cc_static(s); > + } > > tcg_temp_free_i32(func_code); > tcg_temp_free_i32(r3);
On 08.07.20 20:51, David Hildenbrand wrote: > Let's implement the "storage configuration" part of diag260. This diag > is found under z/VM, to indicate usable chunks of memory tot he guest OS. > As I don't have access to documentation, I have no clue what the actual > error cases are, and which other stuff we could eventually query using this > interface. Somebody with access to documentation should fix this. This > implementation seems to work with Linux guests just fine. > > The Linux kernel supports diag260 to query the available memory since > v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM > (with maxmem being defined and bigger than the memory size, e.g., "-m > 2G,maxmem=4G"), just as if support for SCLP storage information is not > implemented. They will fail to detect the actual initial memory size. > > This interface allows us to expose the maximum ramsize via sclp > and the initial ramsize via diag260 - without having to mess with the > memory increment size and having to align the initial memory size to it. > > This is a preparation for memory device support. We'll unlock the > implementation with a new QEMU machine that supports memory devices. > > Signed-off-by: David Hildenbrand <david@redhat.com> I have not looked into this, so this is purely a question. Is there a way to hotplug virtio-mem memory beyond the initial size of the memory as specified by the initial sclp)? then we could avoid doing this platform specfic diag260? the only issue I see is when we need to go beyond 4TB due to the page table upgrade in the kernel. FWIW diag 260 is publicly documented.
On 09.07.20 12:37, Cornelia Huck wrote: > On Wed, 8 Jul 2020 20:51:32 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> Let's implement the "storage configuration" part of diag260. This diag >> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >> As I don't have access to documentation, I have no clue what the actual >> error cases are, and which other stuff we could eventually query using this >> interface. Somebody with access to documentation should fix this. This >> implementation seems to work with Linux guests just fine. >> >> The Linux kernel supports diag260 to query the available memory since >> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >> (with maxmem being defined and bigger than the memory size, e.g., "-m >> 2G,maxmem=4G"), just as if support for SCLP storage information is not >> implemented. They will fail to detect the actual initial memory size. >> >> This interface allows us to expose the maximum ramsize via sclp >> and the initial ramsize via diag260 - without having to mess with the >> memory increment size and having to align the initial memory size to it. >> >> This is a preparation for memory device support. We'll unlock the >> implementation with a new QEMU machine that supports memory devices. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> target/s390x/diag.c | 57 ++++++++++++++++++++++++++++++++++++++ >> target/s390x/internal.h | 2 ++ >> target/s390x/kvm.c | 11 ++++++++ >> target/s390x/misc_helper.c | 6 ++++ >> target/s390x/translate.c | 4 +++ >> 5 files changed, 80 insertions(+) >> >> diff --git a/target/s390x/diag.c b/target/s390x/diag.c >> index 1a48429564..c3b1e24b2c 100644 >> --- a/target/s390x/diag.c >> +++ b/target/s390x/diag.c >> @@ -23,6 +23,63 @@ >> #include "hw/s390x/pv.h" >> #include "kvm_s390x.h" >> >> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra) >> +{ >> + MachineState *ms = MACHINE(qdev_get_machine()); >> + const ram_addr_t initial_ram_size = ms->ram_size; >> + const uint64_t subcode = env->regs[r3]; >> + S390CPU *cpu = env_archcpu(env); >> + ram_addr_t addr, length; >> + uint64_t tmp; >> + >> + /* TODO: Unlock with new QEMU machine. */ >> + if (false) { >> + s390_program_interrupt(env, PGM_OPERATION, ra); >> + return; >> + } >> + >> + /* >> + * There also seems to be subcode "0xc", which stores the size of the >> + * first chunk and the total size to r1/r2. It's only used by very old >> + * Linux, so don't implement it. > > FWIW, > https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf > seems to list the available subcodes. Anything but 0xc and 0x10 is for > 24/31 bit only, so we can safely ignore them. Not sure what we want to > do with 0xc: it is supposed to "Return the highest addressable byte of > virtual storage in the host-primary address space, including named > saved systems and saved segments", so returning the end of the address > space should be easy enough, but not very useful. Thanks for the link to the documentation! Either my google search skills are bad or that stuff is just hard to find :) I'll have a look and see how to make sense of 0xc. Smells like "maxram_size - 1" indeed. > >> + */ >> + if ((r1 & 1) || subcode != 0x10) { >> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >> + return; >> + } >> + addr = env->regs[r1]; >> + length = env->regs[r1 + 1]; >> + >> + /* FIXME: Somebody with documentation should fix this. */ > > Doc mentioned above says for specification exception: > > "For subcode X'10': > • Rx is not an even-numbered register. > • The address contained in Rx is not on a quadword boundary. > • The length contained in Rx+1 is not a positive multiple of 16." > >> + if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) { >> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >> + return; >> + } >> + >> + /* FIXME: Somebody with documentation should fix this. */ >> + if (!length) { > > Probably specification exception as well? Yeah I'll add "|| !length" above. > >> + setcc(cpu, 3); >> + return; >> + } >> + >> + /* FIXME: Somebody with documentation should fix this. */ > > For access exception: > > "For subcode X'10', an error occurred trying to store the extent > information into the guest's output area." > Okay, looks good then! >> + if (!address_space_access_valid(&address_space_memory, addr, length, true, >> + MEMTXATTRS_UNSPECIFIED)) { >> + s390_program_interrupt(env, PGM_ADDRESSING, ra); >> + return; >> + } >> + >> + /* Indicate our initial memory ([0 .. ram_size - 1]) */ >> + tmp = cpu_to_be64(0); >> + cpu_physical_memory_write(addr, &tmp, sizeof(tmp)); >> + tmp = cpu_to_be64(initial_ram_size - 1); >> + cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp)); >> + >> + /* Exactly one entry was stored. */ >> + env->regs[r3] = 1; >> + setcc(cpu, 0); >> +} >> + >> int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3) >> { >> uint64_t func = env->regs[r1]; > > (...) > >> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c >> index 58dbc023eb..d7274eb320 100644 >> --- a/target/s390x/misc_helper.c >> +++ b/target/s390x/misc_helper.c >> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) >> uint64_t r; >> >> switch (num) { >> + case 0x260: >> + qemu_mutex_lock_iothread(); >> + handle_diag_260(env, r1, r3, GETPC()); >> + qemu_mutex_unlock_iothread(); >> + r = 0; >> + break; >> case 0x500: >> /* KVM hypercall */ >> qemu_mutex_lock_iothread(); > > Looking at the doc referenced above, it seems that we treat every diag > call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated > to your patch; maybe I'm misreading.) Interesting. Adding in onto my todo list. Thanks again!
On 09.07.20 12:52, Christian Borntraeger wrote: > > On 08.07.20 20:51, David Hildenbrand wrote: >> Let's implement the "storage configuration" part of diag260. This diag >> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >> As I don't have access to documentation, I have no clue what the actual >> error cases are, and which other stuff we could eventually query using this >> interface. Somebody with access to documentation should fix this. This >> implementation seems to work with Linux guests just fine. >> >> The Linux kernel supports diag260 to query the available memory since >> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >> (with maxmem being defined and bigger than the memory size, e.g., "-m >> 2G,maxmem=4G"), just as if support for SCLP storage information is not >> implemented. They will fail to detect the actual initial memory size. >> >> This interface allows us to expose the maximum ramsize via sclp >> and the initial ramsize via diag260 - without having to mess with the >> memory increment size and having to align the initial memory size to it. >> >> This is a preparation for memory device support. We'll unlock the >> implementation with a new QEMU machine that supports memory devices. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > I have not looked into this, so this is purely a question. > > Is there a way to hotplug virtio-mem memory beyond the initial size of > the memory as specified by the initial sclp)? then we could avoid doing > this platform specfic diag260? We need a way to tell the guest about the maximum possible PFN, so it can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT tables. On s390x, the only way I see is using a combination of diag260, without introducing any other new mechanisms. Currently Linux selects 3. vs 4 level page tables based on that size (I think that's what you were referring to with the 4TB limit). I can see that kasan also does some magic based on the value ("populate kasan shadow for untracked memory"), but did not look into the details. I *think* kasan will never be able to track that memory, but am not completely sure. I'd like to avoid something as you propose (that's why I searched and discovered diag260 after all :) ), especially to not silently break in the future, when other assumptions based on that value are introduced. E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as default, so it does not seem to be a corner case mechanism nowadays. > the only issue I see is when we need to go beyond 4TB due to the page table > upgrade in the kernel. > > FWIW diag 260 is publicly documented. Yeah, Conny pointed me at the doc - makes things easier :)
On 09.07.20 12:37, Cornelia Huck wrote: > On Wed, 8 Jul 2020 20:51:32 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> Let's implement the "storage configuration" part of diag260. This diag >> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >> As I don't have access to documentation, I have no clue what the actual >> error cases are, and which other stuff we could eventually query using this >> interface. Somebody with access to documentation should fix this. This >> implementation seems to work with Linux guests just fine. >> >> The Linux kernel supports diag260 to query the available memory since >> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >> (with maxmem being defined and bigger than the memory size, e.g., "-m >> 2G,maxmem=4G"), just as if support for SCLP storage information is not >> implemented. They will fail to detect the actual initial memory size. >> >> This interface allows us to expose the maximum ramsize via sclp >> and the initial ramsize via diag260 - without having to mess with the >> memory increment size and having to align the initial memory size to it. >> >> This is a preparation for memory device support. We'll unlock the >> implementation with a new QEMU machine that supports memory devices. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> target/s390x/diag.c | 57 ++++++++++++++++++++++++++++++++++++++ >> target/s390x/internal.h | 2 ++ >> target/s390x/kvm.c | 11 ++++++++ >> target/s390x/misc_helper.c | 6 ++++ >> target/s390x/translate.c | 4 +++ >> 5 files changed, 80 insertions(+) >> >> diff --git a/target/s390x/diag.c b/target/s390x/diag.c >> index 1a48429564..c3b1e24b2c 100644 >> --- a/target/s390x/diag.c >> +++ b/target/s390x/diag.c >> @@ -23,6 +23,63 @@ >> #include "hw/s390x/pv.h" >> #include "kvm_s390x.h" >> >> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra) >> +{ >> + MachineState *ms = MACHINE(qdev_get_machine()); >> + const ram_addr_t initial_ram_size = ms->ram_size; >> + const uint64_t subcode = env->regs[r3]; >> + S390CPU *cpu = env_archcpu(env); >> + ram_addr_t addr, length; >> + uint64_t tmp; >> + >> + /* TODO: Unlock with new QEMU machine. */ >> + if (false) { >> + s390_program_interrupt(env, PGM_OPERATION, ra); >> + return; >> + } >> + >> + /* >> + * There also seems to be subcode "0xc", which stores the size of the >> + * first chunk and the total size to r1/r2. It's only used by very old >> + * Linux, so don't implement it. > > FWIW, > https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf > seems to list the available subcodes. Anything but 0xc and 0x10 is for > 24/31 bit only, so we can safely ignore them. Not sure what we want to > do with 0xc: it is supposed to "Return the highest addressable byte of > virtual storage in the host-primary address space, including named > saved systems and saved segments", so returning the end of the address > space should be easy enough, but not very useful. > >> + */ >> + if ((r1 & 1) || subcode != 0x10) { >> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >> + return; >> + } >> + addr = env->regs[r1]; >> + length = env->regs[r1 + 1]; >> + >> + /* FIXME: Somebody with documentation should fix this. */ > > Doc mentioned above says for specification exception: > > "For subcode X'10': > • Rx is not an even-numbered register. > • The address contained in Rx is not on a quadword boundary. > • The length contained in Rx+1 is not a positive multiple of 16." > >> + if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) { >> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >> + return; >> + } >> + >> + /* FIXME: Somebody with documentation should fix this. */ >> + if (!length) { > > Probably specification exception as well? > >> + setcc(cpu, 3); >> + return; >> + } >> + >> + /* FIXME: Somebody with documentation should fix this. */ > > For access exception: > > "For subcode X'10', an error occurred trying to store the extent > information into the guest's output area." > >> + if (!address_space_access_valid(&address_space_memory, addr, length, true, >> + MEMTXATTRS_UNSPECIFIED)) { >> + s390_program_interrupt(env, PGM_ADDRESSING, ra); >> + return; >> + } >> + >> + /* Indicate our initial memory ([0 .. ram_size - 1]) */ >> + tmp = cpu_to_be64(0); >> + cpu_physical_memory_write(addr, &tmp, sizeof(tmp)); >> + tmp = cpu_to_be64(initial_ram_size - 1); >> + cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp)); >> + >> + /* Exactly one entry was stored. */ >> + env->regs[r3] = 1; >> + setcc(cpu, 0); >> +} >> + >> int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3) >> { >> uint64_t func = env->regs[r1]; > > (...) > >> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c >> index 58dbc023eb..d7274eb320 100644 >> --- a/target/s390x/misc_helper.c >> +++ b/target/s390x/misc_helper.c >> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) >> uint64_t r; >> >> switch (num) { >> + case 0x260: >> + qemu_mutex_lock_iothread(); >> + handle_diag_260(env, r1, r3, GETPC()); >> + qemu_mutex_unlock_iothread(); >> + r = 0; >> + break; >> case 0x500: >> /* KVM hypercall */ >> qemu_mutex_lock_iothread(); > > Looking at the doc referenced above, it seems that we treat every diag > call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated > to your patch; maybe I'm misreading.) That's also a BUG in kvm then? int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) { ... if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); ... }
On 10.07.20 10:32, David Hildenbrand wrote: > On 09.07.20 12:37, Cornelia Huck wrote: >> On Wed, 8 Jul 2020 20:51:32 +0200 >> David Hildenbrand <david@redhat.com> wrote: >> >>> Let's implement the "storage configuration" part of diag260. This diag >>> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >>> As I don't have access to documentation, I have no clue what the actual >>> error cases are, and which other stuff we could eventually query using this >>> interface. Somebody with access to documentation should fix this. This >>> implementation seems to work with Linux guests just fine. >>> >>> The Linux kernel supports diag260 to query the available memory since >>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >>> (with maxmem being defined and bigger than the memory size, e.g., "-m >>> 2G,maxmem=4G"), just as if support for SCLP storage information is not >>> implemented. They will fail to detect the actual initial memory size. >>> >>> This interface allows us to expose the maximum ramsize via sclp >>> and the initial ramsize via diag260 - without having to mess with the >>> memory increment size and having to align the initial memory size to it. >>> >>> This is a preparation for memory device support. We'll unlock the >>> implementation with a new QEMU machine that supports memory devices. >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> target/s390x/diag.c | 57 ++++++++++++++++++++++++++++++++++++++ >>> target/s390x/internal.h | 2 ++ >>> target/s390x/kvm.c | 11 ++++++++ >>> target/s390x/misc_helper.c | 6 ++++ >>> target/s390x/translate.c | 4 +++ >>> 5 files changed, 80 insertions(+) >>> >>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c >>> index 1a48429564..c3b1e24b2c 100644 >>> --- a/target/s390x/diag.c >>> +++ b/target/s390x/diag.c >>> @@ -23,6 +23,63 @@ >>> #include "hw/s390x/pv.h" >>> #include "kvm_s390x.h" >>> >>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra) >>> +{ >>> + MachineState *ms = MACHINE(qdev_get_machine()); >>> + const ram_addr_t initial_ram_size = ms->ram_size; >>> + const uint64_t subcode = env->regs[r3]; >>> + S390CPU *cpu = env_archcpu(env); >>> + ram_addr_t addr, length; >>> + uint64_t tmp; >>> + >>> + /* TODO: Unlock with new QEMU machine. */ >>> + if (false) { >>> + s390_program_interrupt(env, PGM_OPERATION, ra); >>> + return; >>> + } >>> + >>> + /* >>> + * There also seems to be subcode "0xc", which stores the size of the >>> + * first chunk and the total size to r1/r2. It's only used by very old >>> + * Linux, so don't implement it. >> >> FWIW, >> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf >> seems to list the available subcodes. Anything but 0xc and 0x10 is for >> 24/31 bit only, so we can safely ignore them. Not sure what we want to >> do with 0xc: it is supposed to "Return the highest addressable byte of >> virtual storage in the host-primary address space, including named >> saved systems and saved segments", so returning the end of the address >> space should be easy enough, but not very useful. >> >>> + */ >>> + if ((r1 & 1) || subcode != 0x10) { >>> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >>> + return; >>> + } >>> + addr = env->regs[r1]; >>> + length = env->regs[r1 + 1]; >>> + >>> + /* FIXME: Somebody with documentation should fix this. */ >> >> Doc mentioned above says for specification exception: >> >> "For subcode X'10': >> • Rx is not an even-numbered register. >> • The address contained in Rx is not on a quadword boundary. >> • The length contained in Rx+1 is not a positive multiple of 16." >> >>> + if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) { >>> + s390_program_interrupt(env, PGM_SPECIFICATION, ra); >>> + return; >>> + } >>> + >>> + /* FIXME: Somebody with documentation should fix this. */ >>> + if (!length) { >> >> Probably specification exception as well? >> >>> + setcc(cpu, 3); >>> + return; >>> + } >>> + >>> + /* FIXME: Somebody with documentation should fix this. */ >> >> For access exception: >> >> "For subcode X'10', an error occurred trying to store the extent >> information into the guest's output area." >> >>> + if (!address_space_access_valid(&address_space_memory, addr, length, true, >>> + MEMTXATTRS_UNSPECIFIED)) { >>> + s390_program_interrupt(env, PGM_ADDRESSING, ra); >>> + return; >>> + } >>> + >>> + /* Indicate our initial memory ([0 .. ram_size - 1]) */ >>> + tmp = cpu_to_be64(0); >>> + cpu_physical_memory_write(addr, &tmp, sizeof(tmp)); >>> + tmp = cpu_to_be64(initial_ram_size - 1); >>> + cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp)); >>> + >>> + /* Exactly one entry was stored. */ >>> + env->regs[r3] = 1; >>> + setcc(cpu, 0); >>> +} >>> + >>> int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3) >>> { >>> uint64_t func = env->regs[r1]; >> >> (...) >> >>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c >>> index 58dbc023eb..d7274eb320 100644 >>> --- a/target/s390x/misc_helper.c >>> +++ b/target/s390x/misc_helper.c >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) >>> uint64_t r; >>> >>> switch (num) { >>> + case 0x260: >>> + qemu_mutex_lock_iothread(); >>> + handle_diag_260(env, r1, r3, GETPC()); >>> + qemu_mutex_unlock_iothread(); >>> + r = 0; >>> + break; >>> case 0x500: >>> /* KVM hypercall */ >>> qemu_mutex_lock_iothread(); >> >> Looking at the doc referenced above, it seems that we treat every diag >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated >> to your patch; maybe I'm misreading.) > > That's also a BUG in kvm then? > > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) > { > ... > if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) > return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); > ... > } > But OTOH, it does not sound sane if user space can bypass the OS to yield the CPU ... so this might just be a wrong documentation. All DIAGs should be privileged IIRC.
On 09.07.20 20:15, David Hildenbrand wrote: > On 09.07.20 12:52, Christian Borntraeger wrote: >> >> On 08.07.20 20:51, David Hildenbrand wrote: >>> Let's implement the "storage configuration" part of diag260. This diag >>> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >>> As I don't have access to documentation, I have no clue what the actual >>> error cases are, and which other stuff we could eventually query using this >>> interface. Somebody with access to documentation should fix this. This >>> implementation seems to work with Linux guests just fine. >>> >>> The Linux kernel supports diag260 to query the available memory since >>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >>> (with maxmem being defined and bigger than the memory size, e.g., "-m >>> 2G,maxmem=4G"), just as if support for SCLP storage information is not >>> implemented. They will fail to detect the actual initial memory size. >>> >>> This interface allows us to expose the maximum ramsize via sclp >>> and the initial ramsize via diag260 - without having to mess with the >>> memory increment size and having to align the initial memory size to it. >>> >>> This is a preparation for memory device support. We'll unlock the >>> implementation with a new QEMU machine that supports memory devices. >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> I have not looked into this, so this is purely a question. >> >> Is there a way to hotplug virtio-mem memory beyond the initial size of >> the memory as specified by the initial sclp)? then we could avoid doing >> this platform specfic diag260? > > We need a way to tell the guest about the maximum possible PFN, so it > can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT > tables. On s390x, the only way I see is using a combination of diag260, > without introducing any other new mechanisms. > > Currently Linux selects 3. vs 4 level page tables based on that size (I > think that's what you were referring to with the 4TB limit). I can see > that kasan also does some magic based on the value ("populate kasan > shadow for untracked memory"), but did not look into the details. I > *think* kasan will never be able to track that memory, but am not > completely sure. > > I'd like to avoid something as you propose (that's why I searched and > discovered diag260 after all :) ), especially to not silently break in > the future, when other assumptions based on that value are introduced. > > E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as > default, so it does not seem to be a corner case mechanism nowadays. > Note: Reading about diag260 subcode 0xc, we could modify Linux to query the maximum possible pfn via diag260 0xc. Then, we maybe could avoid indicating maxram size via SCLP, and keep diag260-unaware OSs keep working as before. Thoughts?
On Fri, 10 Jul 2020 10:41:33 +0200 David Hildenbrand <david@redhat.com> wrote: > On 10.07.20 10:32, David Hildenbrand wrote: > > On 09.07.20 12:37, Cornelia Huck wrote: > >> On Wed, 8 Jul 2020 20:51:32 +0200 > >> David Hildenbrand <david@redhat.com> wrote: > >>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c > >>> index 58dbc023eb..d7274eb320 100644 > >>> --- a/target/s390x/misc_helper.c > >>> +++ b/target/s390x/misc_helper.c > >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) > >>> uint64_t r; > >>> > >>> switch (num) { > >>> + case 0x260: > >>> + qemu_mutex_lock_iothread(); > >>> + handle_diag_260(env, r1, r3, GETPC()); > >>> + qemu_mutex_unlock_iothread(); > >>> + r = 0; > >>> + break; > >>> case 0x500: > >>> /* KVM hypercall */ > >>> qemu_mutex_lock_iothread(); > >> > >> Looking at the doc referenced above, it seems that we treat every diag > >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated > >> to your patch; maybe I'm misreading.) > > > > That's also a BUG in kvm then? > > > > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) > > { > > ... > > if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) > > return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); > > ... > > } > > > > But OTOH, it does not sound sane if user space can bypass the OS to > yield the CPU ... so this might just be a wrong documentation. All DIAGs > should be privileged IIRC. Maybe not all of them, but the diag 0x44 case is indeed odd. No idea what is documented for its use on LPAR (I don't think that document is public.)
On 10.07.20 11:17, David Hildenbrand wrote: > On 09.07.20 20:15, David Hildenbrand wrote: >> On 09.07.20 12:52, Christian Borntraeger wrote: >>> >>> On 08.07.20 20:51, David Hildenbrand wrote: >>>> Let's implement the "storage configuration" part of diag260. This diag >>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS. >>>> As I don't have access to documentation, I have no clue what the actual >>>> error cases are, and which other stuff we could eventually query using this >>>> interface. Somebody with access to documentation should fix this. This >>>> implementation seems to work with Linux guests just fine. >>>> >>>> The Linux kernel supports diag260 to query the available memory since >>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM >>>> (with maxmem being defined and bigger than the memory size, e.g., "-m >>>> 2G,maxmem=4G"), just as if support for SCLP storage information is not >>>> implemented. They will fail to detect the actual initial memory size. >>>> >>>> This interface allows us to expose the maximum ramsize via sclp >>>> and the initial ramsize via diag260 - without having to mess with the >>>> memory increment size and having to align the initial memory size to it. >>>> >>>> This is a preparation for memory device support. We'll unlock the >>>> implementation with a new QEMU machine that supports memory devices. >>>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> I have not looked into this, so this is purely a question. >>> >>> Is there a way to hotplug virtio-mem memory beyond the initial size of >>> the memory as specified by the initial sclp)? then we could avoid doing >>> this platform specfic diag260? >> >> We need a way to tell the guest about the maximum possible PFN, so it >> can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT >> tables. On s390x, the only way I see is using a combination of diag260, >> without introducing any other new mechanisms. >> >> Currently Linux selects 3. vs 4 level page tables based on that size (I >> think that's what you were referring to with the 4TB limit). I can see >> that kasan also does some magic based on the value ("populate kasan >> shadow for untracked memory"), but did not look into the details. I >> *think* kasan will never be able to track that memory, but am not >> completely sure. >> >> I'd like to avoid something as you propose (that's why I searched and >> discovered diag260 after all :) ), especially to not silently break in >> the future, when other assumptions based on that value are introduced. >> >> E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as >> default, so it does not seem to be a corner case mechanism nowadays. >> > > Note: Reading about diag260 subcode 0xc, we could modify Linux to query > the maximum possible pfn via diag260 0xc. Then, we maybe could avoid > indicating maxram size via SCLP, and keep diag260-unaware OSs keep > working as before. Thoughts? Implemented it, seems to work fine.
On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: > > Note: Reading about diag260 subcode 0xc, we could modify Linux to query > > the maximum possible pfn via diag260 0xc. Then, we maybe could avoid > > indicating maxram size via SCLP, and keep diag260-unaware OSs keep > > working as before. Thoughts? > > Implemented it, seems to work fine. The returned value would not include standby/reserved memory within z/VM. So this seems not to work. Also: why do you want to change this?
On 10.07.20 17:18, Heiko Carstens wrote: > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep >>> working as before. Thoughts? >> >> Implemented it, seems to work fine. > > The returned value would not include standby/reserved memory within > z/VM. So this seems not to work. Which value exactly are you referencing? diag 0xc returns two values. One of them seems to do exactly what we need. See https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7 for my current Linux approach. > Also: why do you want to change this Which change exactly do you mean? If we limit the value returned via SCLP to initial memory, we cannot break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then purely optional.
On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: > On 10.07.20 17:18, Heiko Carstens wrote: > > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: > >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query > >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid > >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep > >>> working as before. Thoughts? > >> > >> Implemented it, seems to work fine. > > > > The returned value would not include standby/reserved memory within > > z/VM. So this seems not to work. > > Which value exactly are you referencing? diag 0xc returns two values. > One of them seems to do exactly what we need. Maybe I'm missing something as usual, but to me this -------- Usage Notes: ... 2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE command to configure reserved or standby storage for a guest, the values returned in Rx and Ry will be the current values, but these values can change dynamically depending on the options specified and any dynamic storage reconfiguration (DSR) changes initiated by the guest. -------- reads like it is not doing what you want. That is: it does *not* include standby memory and therefore will not return the highest possible pfn.
> Am 10.07.2020 um 17:43 schrieb Heiko Carstens <hca@linux.ibm.com>: > > On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: >>> On 10.07.20 17:18, Heiko Carstens wrote: >>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: >>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query >>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid >>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep >>>>> working as before. Thoughts? >>>> >>>> Implemented it, seems to work fine. >>> >>> The returned value would not include standby/reserved memory within >>> z/VM. So this seems not to work. >> >> Which value exactly are you referencing? diag 0xc returns two values. >> One of them seems to do exactly what we need. > > Maybe I'm missing something as usual, but to me this > -------- > Usage Notes: > ... > 2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE > command to configure reserved or standby storage for a guest, the > values returned in Rx and Ry will be the current values, but these > values can change dynamically depending on the options specified and > any dynamic storage reconfiguration (DSR) changes initiated by the > guest. > -------- > reads like it is not doing what you want. That is: it does *not* > include standby memory and therefore will not return the highest > possible pfn. > Ah, yes. See the kernel patch, I take the max of both values (SCLP, diag260(0xc)) values. Anyhow, what would be your recommendation? Thanks!
On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: > On 10.07.20 17:18, Heiko Carstens wrote: > > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: > >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query > >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid > >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep > >>> working as before. Thoughts? > >> > >> Implemented it, seems to work fine. > > > > The returned value would not include standby/reserved memory within > > z/VM. So this seems not to work. > > Which value exactly are you referencing? diag 0xc returns two values. > One of them seems to do exactly what we need. > > See > https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7 > > for my current Linux approach. > > > Also: why do you want to change this > > Which change exactly do you mean? > > If we limit the value returned via SCLP to initial memory, we cannot > break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then > purely optional. Ok, now I see the context. Christian added my just to cc on this specific patch. So if I understand you correctly, then you want to use diag 260 in order to figure out how much memory is _potentially_ available for a guest? This does not fit to the current semantics, since diag 260 returns the address of the highest *currently* accessible address. That is: it does explicitly *not* include standby memory or anything else that might potentially be there. So you would need a different interface to tell the guest about your new hotplug memory interface. If sclp does not work, then maybe a new diagnose(?).
> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>: > > On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: >>> On 10.07.20 17:18, Heiko Carstens wrote: >>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: >>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query >>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid >>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep >>>>> working as before. Thoughts? >>>> >>>> Implemented it, seems to work fine. >>> >>> The returned value would not include standby/reserved memory within >>> z/VM. So this seems not to work. >> >> Which value exactly are you referencing? diag 0xc returns two values. >> One of them seems to do exactly what we need. >> >> See >> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7 >> >> for my current Linux approach. >> >>> Also: why do you want to change this >> >> Which change exactly do you mean? >> >> If we limit the value returned via SCLP to initial memory, we cannot >> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then >> purely optional. > > Ok, now I see the context. Christian added my just to cc on this > specific patch. I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up). > So if I understand you correctly, then you want to use diag 260 in > order to figure out how much memory is _potentially_ available for a > guest? Yes, exactly. > > This does not fit to the current semantics, since diag 260 returns the > address of the highest *currently* accessible address. That is: it > does explicitly *not* include standby memory or anything else that > might potentially be there. The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory. I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :) > > So you would need a different interface to tell the guest about your > new hotplug memory interface. If sclp does not work, then maybe a new > diagnose(?). > Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?
On 13.07.20 12:27, David Hildenbrand wrote: > > >> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>: >> >> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: >>>> On 10.07.20 17:18, Heiko Carstens wrote: >>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: >>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query >>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid >>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep >>>>>> working as before. Thoughts? >>>>> >>>>> Implemented it, seems to work fine. >>>> >>>> The returned value would not include standby/reserved memory within >>>> z/VM. So this seems not to work. >>> >>> Which value exactly are you referencing? diag 0xc returns two values. >>> One of them seems to do exactly what we need. >>> >>> See >>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7 >>> >>> for my current Linux approach. >>> >>>> Also: why do you want to change this >>> >>> Which change exactly do you mean? >>> >>> If we limit the value returned via SCLP to initial memory, we cannot >>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then >>> purely optional. >> >> Ok, now I see the context. Christian added my just to cc on this >> specific patch. > > I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up). > >> So if I understand you correctly, then you want to use diag 260 in >> order to figure out how much memory is _potentially_ available for a >> guest? > > Yes, exactly. > >> >> This does not fit to the current semantics, since diag 260 returns the >> address of the highest *currently* accessible address. That is: it >> does explicitly *not* include standby memory or anything else that >> might potentially be there. > > The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory. > > I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :) > >> >> So you would need a different interface to tell the guest about your >> new hotplug memory interface. If sclp does not work, then maybe a new >> diagnose(?). >> > > Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address) And then (when I got the discussion right) use diag 260 to get the _current_ value.
On 10.07.20 10:32, David Hildenbrand wrote: >>> --- a/target/s390x/misc_helper.c >>> +++ b/target/s390x/misc_helper.c >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) >>> uint64_t r; >>> >>> switch (num) { >>> + case 0x260: >>> + qemu_mutex_lock_iothread(); >>> + handle_diag_260(env, r1, r3, GETPC()); >>> + qemu_mutex_unlock_iothread(); >>> + r = 0; >>> + break; >>> case 0x500: >>> /* KVM hypercall */ >>> qemu_mutex_lock_iothread(); >> >> Looking at the doc referenced above, it seems that we treat every diag >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated >> to your patch; maybe I'm misreading.) > > That's also a BUG in kvm then? > > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) > { > ... > if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) > return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); > ... > } diag 44 gives a PRIVOP on LPAR, so I think this is fine.
On Mon, 13 Jul 2020 13:54:41 +0200 Christian Borntraeger <borntraeger@de.ibm.com> wrote: > On 10.07.20 10:32, David Hildenbrand wrote: > > >>> --- a/target/s390x/misc_helper.c > >>> +++ b/target/s390x/misc_helper.c > >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) > >>> uint64_t r; > >>> > >>> switch (num) { > >>> + case 0x260: > >>> + qemu_mutex_lock_iothread(); > >>> + handle_diag_260(env, r1, r3, GETPC()); > >>> + qemu_mutex_unlock_iothread(); > >>> + r = 0; > >>> + break; > >>> case 0x500: > >>> /* KVM hypercall */ > >>> qemu_mutex_lock_iothread(); > >> > >> Looking at the doc referenced above, it seems that we treat every diag > >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated > >> to your patch; maybe I'm misreading.) > > > > That's also a BUG in kvm then? > > > > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) > > { > > ... > > if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) > > return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); > > ... > > } > > diag 44 gives a PRIVOP on LPAR, so I think this is fine. > Seems like a bug/inconsistency in CP (or its documentation), then.
On 13.07.20 14:11, Cornelia Huck wrote: > On Mon, 13 Jul 2020 13:54:41 +0200 > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > >> On 10.07.20 10:32, David Hildenbrand wrote: >> >>>>> --- a/target/s390x/misc_helper.c >>>>> +++ b/target/s390x/misc_helper.c >>>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) >>>>> uint64_t r; >>>>> >>>>> switch (num) { >>>>> + case 0x260: >>>>> + qemu_mutex_lock_iothread(); >>>>> + handle_diag_260(env, r1, r3, GETPC()); >>>>> + qemu_mutex_unlock_iothread(); >>>>> + r = 0; >>>>> + break; >>>>> case 0x500: >>>>> /* KVM hypercall */ >>>>> qemu_mutex_lock_iothread(); >>>> >>>> Looking at the doc referenced above, it seems that we treat every diag >>>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated >>>> to your patch; maybe I'm misreading.) >>> >>> That's also a BUG in kvm then? >>> >>> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu) >>> { >>> ... >>> if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) >>> return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); >>> ... >>> } >> >> diag 44 gives a PRIVOP on LPAR, so I think this is fine. >> > > Seems like a bug/inconsistency in CP (or its documentation), then. Yes. .globl main main: diag 0,0,0x44 svc 1 also crashes under z/VM with an illegal op.
On 13.07.20 13:08, Christian Borntraeger wrote: > On 13.07.20 12:27, David Hildenbrand wrote: >> >> >>> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>: >>> >>> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote: >>>>> On 10.07.20 17:18, Heiko Carstens wrote: >>>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote: >>>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query >>>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid >>>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep >>>>>>> working as before. Thoughts? >>>>>> >>>>>> Implemented it, seems to work fine. >>>>> >>>>> The returned value would not include standby/reserved memory within >>>>> z/VM. So this seems not to work. >>>> >>>> Which value exactly are you referencing? diag 0xc returns two values. >>>> One of them seems to do exactly what we need. >>>> >>>> See >>>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7 >>>> >>>> for my current Linux approach. >>>> >>>>> Also: why do you want to change this >>>> >>>> Which change exactly do you mean? >>>> >>>> If we limit the value returned via SCLP to initial memory, we cannot >>>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then >>>> purely optional. >>> >>> Ok, now I see the context. Christian added my just to cc on this >>> specific patch. >> >> I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up). >> >>> So if I understand you correctly, then you want to use diag 260 in >>> order to figure out how much memory is _potentially_ available for a >>> guest? >> >> Yes, exactly. >> >>> >>> This does not fit to the current semantics, since diag 260 returns the >>> address of the highest *currently* accessible address. That is: it >>> does explicitly *not* include standby memory or anything else that >>> might potentially be there. >> >> The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory. >> >> I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :) >> >>> >>> So you would need a different interface to tell the guest about your >>> new hotplug memory interface. If sclp does not work, then maybe a new >>> diagnose(?). >>> >> >> Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> > > Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address) > And then (when I got the discussion right) use diag 260 to get the _current_ value. So, in summary, we want to indicate to the guest a memory region that will be used to place memory devices ("device memory region"). The region might have holes and the memory within this region might have different semantics than ordinary system memory. Memory that belongs to memory devices should only be detected+used if the guest OS has support for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest (e.g., no virtio-mem driver) should not accidentally make use of such memory. We need a way to a) Tell the guest about boot memory (currently ram_size) b) Tell the guest about the maximum possible ram address, including device memory. (We could also indicate the special "device memory region" explicitly) AFAIK, we have three options: 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10) This is what this series (RFCv1 does). Advantages: - No need for a new diag. No need for memory sensing kernel changes. Disadvantages - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will assume all memory is accessible. Bad. - The semantics of the value returned in ry via diag260(0xc) is somewhat unclear. Should we return the end address of the highest memory device? OTOH, an unmodified guest OS (without support for memory devices) should not have to care at all about any such memory. - If we ever want to also support standby memory, we might be in trouble. (see below) 2. Indicate ram_size via SCLP, indicate device memory region (currently maxram_size) via new DIAG Advantages: - Unmodified guests won't use/sense memory belonging to memory devices. - We can later have standby memory + memory devices co-exist. Disadvantages - Need a new DIAG. 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby memory) I did not look into the details, because -ENODOCUMENTATION. At least we would run into some alignment issues (again, having to align ram_size/maxram_size to storage increments - which would no longer be 1MB). We would run into issues later, trying to also support standby memory. I guess 1) would mostly work, one just has to run a suitable guest inside the VM. This is no different to running under z/VM where querying diag260 is required. The nice thing about 2) would be, that we can easily implement standby memory. Something like: -m 2G,maxram_size=20G,standbyram_size=4G [ 2G boot RAM ][ 4G standby RAM ][ 14G device memory ] ^ via SCLP maximum increment ^ via new DIAG
On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote: > So, in summary, we want to indicate to the guest a memory region that > will be used to place memory devices ("device memory region"). The > region might have holes and the memory within this region might have > different semantics than ordinary system memory. Memory that belongs to > memory devices should only be detected+used if the guest OS has support > for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest > (e.g., no virtio-mem driver) should not accidentally make use of such > memory. > > We need a way to > a) Tell the guest about boot memory (currently ram_size) > b) Tell the guest about the maximum possible ram address, including > device memory. (We could also indicate the special "device memory > region" explicitly) > > AFAIK, we have three options: > > 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10) > > This is what this series (RFCv1 does). > > Advantages: > - No need for a new diag. No need for memory sensing kernel changes. > Disadvantages > - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will > assume all memory is accessible. Bad. Why would old guests assume that? At least in v4.1 the kernel will calculate the max address by using increment size * increment number and then test if *each* increment is available with tprot. > - The semantics of the value returned in ry via diag260(0xc) is somewhat > unclear. Should we return the end address of the highest memory > device? OTOH, an unmodified guest OS (without support for memory > devices) should not have to care at all about any such memory. I'm confused. The kernel currently only uses diag260(0x10). How is diag260(0xc) relevant here? > 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby > memory) > > I did not look into the details, because -ENODOCUMENTATION. At least we > would run into some alignment issues (again, having to align > ram_size/maxram_size to storage increments - which would no longer be > 1MB). We would run into issues later, trying to also support standby memory. That doesn't make sense to me: either support memory hotplug via sclp/standby memory, or with your new method. But trying to support both.. what's the use case?
On 15.07.20 12:43, Heiko Carstens wrote: > On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote: >> So, in summary, we want to indicate to the guest a memory region that >> will be used to place memory devices ("device memory region"). The >> region might have holes and the memory within this region might have >> different semantics than ordinary system memory. Memory that belongs to >> memory devices should only be detected+used if the guest OS has support >> for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest >> (e.g., no virtio-mem driver) should not accidentally make use of such >> memory. >> >> We need a way to >> a) Tell the guest about boot memory (currently ram_size) >> b) Tell the guest about the maximum possible ram address, including >> device memory. (We could also indicate the special "device memory >> region" explicitly) >> >> AFAIK, we have three options: >> >> 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10) >> >> This is what this series (RFCv1 does). >> >> Advantages: >> - No need for a new diag. No need for memory sensing kernel changes. >> Disadvantages >> - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will >> assume all memory is accessible. Bad. > > Why would old guests assume that? > > At least in v4.1 the kernel will calculate the max address by using > increment size * increment number and then test if *each* increment is > available with tprot. Yes, we do the same in kvm-unit-tests. But it's not sufficient for memory devices. Just because a tprot succeed (for memory belonging to a memory device) does not mean the kernel should silently start to use that memory. Note: memory devices are not just DIMMs that can be mapped to storage increments. The memory might have completely different semantics, that's why they are glued to a managing virtio device. For example: a tprot might succeed on a memory region provided by virtio-mem, this does, however, not mean that the memory can (and should) be used by the guest. > >> - The semantics of the value returned in ry via diag260(0xc) is somewhat >> unclear. Should we return the end address of the highest memory >> device? OTOH, an unmodified guest OS (without support for memory >> devices) should not have to care at all about any such memory. > > I'm confused. The kernel currently only uses diag260(0x10). How is > diag260(0xc) relevant here? We have to implement diag260(0x10) if we implement diag260(0xc), no? Or can we simply throw a specification exception? > >> 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby >> memory) >> >> I did not look into the details, because -ENODOCUMENTATION. At least we >> would run into some alignment issues (again, having to align >> ram_size/maxram_size to storage increments - which would no longer be >> 1MB). We would run into issues later, trying to also support standby memory. > > That doesn't make sense to me: either support memory hotplug via > sclp/standby memory, or with your new method. But trying to support > both.. what's the use case? Not sure if there is any, it just feels cleaner to me to separate the architectured (sclp memory/reserved/standby) bits that specify a semantic when used via rnmax+tprot from QEMU specific memory ranges that have special semantics. virtio-mem is only one type of a virtio-based memory device. In the future we might want to have virtio-pmem, but there might be more ...
On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote: > > At least in v4.1 the kernel will calculate the max address by using > > increment size * increment number and then test if *each* increment is > > available with tprot. > > Yes, we do the same in kvm-unit-tests. But it's not sufficient for > memory devices. > > Just because a tprot succeed (for memory belonging to a memory device) > does not mean the kernel should silently start to use that memory. > > Note: memory devices are not just DIMMs that can be mapped to storage > increments. The memory might have completely different semantics, that's > why they are glued to a managing virtio device. > > For example: a tprot might succeed on a memory region provided by > virtio-mem, this does, however, not mean that the memory can (and > should) be used by the guest. So, are you saying that even at IPL time there might already be memory devices attached to the system? And the kernel should _not_ treat them as normal memory?
On 15.07.20 13:34, Heiko Carstens wrote: > On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote: >>> At least in v4.1 the kernel will calculate the max address by using >>> increment size * increment number and then test if *each* increment is >>> available with tprot. >> >> Yes, we do the same in kvm-unit-tests. But it's not sufficient for >> memory devices. >> >> Just because a tprot succeed (for memory belonging to a memory device) >> does not mean the kernel should silently start to use that memory. >> >> Note: memory devices are not just DIMMs that can be mapped to storage >> increments. The memory might have completely different semantics, that's >> why they are glued to a managing virtio device. >> >> For example: a tprot might succeed on a memory region provided by >> virtio-mem, this does, however, not mean that the memory can (and >> should) be used by the guest. > > So, are you saying that even at IPL time there might already be memory > devices attached to the system? And the kernel should _not_ treat them > as normal memory? Sorry if that was unclear. Yes, we can have such devices (including memory areas) on a cold boot/reboot/kexec. In addition, they might pop up at runtime (e.g., hotplugging a virtio-mem device). The device is in charge of exposing that area and deciding what to do with it. The kernel should never treat them as normal memory (IOW, system RAM). Not during a cold boot, not during a reboot. The device driver is responsible for deciding how to use that memory (e.g., add it as system RAM), and which parts of that memory are actually valid to be used (even if a tprot might succeed it might not be valid to use just yet - I guess somewhat similar to doing a tport on a dcss area - AFAIK, you also don't want to use it like normal memory). E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never exposed via the e820 map. The only trace that there might be *something* now/in the future is indicated via ACPI SRAT tables. This takes currently care of indicating the maximum possible PFN.
On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote: > > So, are you saying that even at IPL time there might already be memory > > devices attached to the system? And the kernel should _not_ treat them > > as normal memory? > > Sorry if that was unclear. Yes, we can have such devices (including > memory areas) on a cold boot/reboot/kexec. In addition, they might pop > up at runtime (e.g., hotplugging a virtio-mem device). The device is in > charge of exposing that area and deciding what to do with it. > > The kernel should never treat them as normal memory (IOW, system RAM). > Not during a cold boot, not during a reboot. The device driver is > responsible for deciding how to use that memory (e.g., add it as system > RAM), and which parts of that memory are actually valid to be used (even > if a tprot might succeed it might not be valid to use just yet - I guess > somewhat similar to doing a tport on a dcss area - AFAIK, you also don't > want to use it like normal memory). > > E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never > exposed via the e820 map. The only trace that there might be *something* > now/in the future is indicated via ACPI SRAT tables. This takes > currently care of indicating the maximum possible PFN. Ok, but all of this needa to be documented somewhere. This raises a couple of questions to me: What happens on - IPL Clear with this special memory? Will it be detached/away afterwards? - IPL Normal? "Obviously" it must stay otherwise kdump would never see that memory. And when you write it's up to the device driver what to with that memory: is there any documentation available what all of this is good for? I would assume _most likely_ this extra memory is going to be added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But since it is not normal memory, like you say, I'm wondering how that is supposed to work. As far as I can tell there would be a lot of inconsistencies in userspace interfaces which provide memory / zone information. Or I'm not getting the point of all of this at all. So please provide more information, or a pointer to documentation.
On 15.07.20 18:14, Heiko Carstens wrote: > On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote: >>> So, are you saying that even at IPL time there might already be memory >>> devices attached to the system? And the kernel should _not_ treat them >>> as normal memory? >> >> Sorry if that was unclear. Yes, we can have such devices (including >> memory areas) on a cold boot/reboot/kexec. In addition, they might pop >> up at runtime (e.g., hotplugging a virtio-mem device). The device is in >> charge of exposing that area and deciding what to do with it. >> >> The kernel should never treat them as normal memory (IOW, system RAM). >> Not during a cold boot, not during a reboot. The device driver is >> responsible for deciding how to use that memory (e.g., add it as system >> RAM), and which parts of that memory are actually valid to be used (even >> if a tprot might succeed it might not be valid to use just yet - I guess >> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't >> want to use it like normal memory). >> >> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never >> exposed via the e820 map. The only trace that there might be *something* >> now/in the future is indicated via ACPI SRAT tables. This takes >> currently care of indicating the maximum possible PFN. > > Ok, but all of this needa to be documented somewhere. This raises a > couple of questions to me: I assume this mostly targets virtio-mem, because the semantics of virtio-mem provided memory are extra-weird (in contrast to rather static virtio-pmem, which is essentially just an emulated NVDIMM - a disk mapped into physical memory). Regarding documentation (some linked in the cover letter), so far I have (generic/x86-64) 1. https://virtio-mem.gitlab.io/ 2. virtio spec proposal [1] 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug") 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug") 5. Linux cover letter [2] 6. KVM forum talk [3] [4] As your questions go quite into technical detail, and I don't feel like rewriting the doc here :) , I suggest looking at [2], 1, and 5. > > What happens on I'll stick to virtio-mem when answering regarding "special memory". As I noted, there might be more in the future. > > - IPL Clear with this special memory? Will it be detached/away afterwards? A diag308(0x3) - load clear - will usually* zap all virtio-mem provided memory (discard backing storage in the hypervisor) and logically turn the state of all virtio-mem memory inside the device-assigned memory region to "unplugged" - just as during a cold boot. The semantics of "unplugged" blocks depend on the "usable region" (see the virtio-spec if you're curious - the memory might still be accessible). Starting "fresh" with all memory logically unplugged is part of the way virtio-mem works. * there are corner cases while a VM is getting migrated, where we cannot perform this (similar, to us not being able to clear ordinary memory during a load clear in QEMU while migrating). In this case, the memory is left untouched. > - IPL Normal? "Obviously" it must stay otherwise kdump would never see > that memory. Only diag308(0x3) will mess with virtio-mem memory. For the other types of resets, its left untouched. So yes, "obviously" is correct :) > > And when you write it's up to the device driver what to with that > memory: is there any documentation available what all of this is good > for? I would assume _most likely_ this extra memory is going to be > added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But > since it is not normal memory, like you say, I'm wondering how that is > supposed to work. For now 1. virtio-mem adds all (possible) aligned memory via add_memory() to Linux 2. Requires user space to online the memory blocks / configure a zone. For 2., only ZONE_NORMAL really works right now and is recommended to use. As you correctly note, that does not give you any guarantees how much memory you can unplug again (e.g, fragmentation with unmovable data), but is good enough for the first version (with focus on memory hotplug, not unplug). ZONE_MOVABLE support is in the works. However, we cannot blindly expose all memory to ZONE_MOVABLE (zone imbalances leading to rashes), and sometimes also don't want to (e.g., gigantic pages). Without spoilering too much, a mixture would be nice. > > As far as I can tell there would be a lot of inconsistencies in > userspace interfaces which provide memory / zone information. Or I'm > not getting the point of all of this at all. All memory/zone stats are properly fixed up (similar to ballooning). The only visible inconsistency that *might* happen when unplugging memory / hotplugging memory in <256MB on s390x, is that the number of memory block devices (/sys/devices/system/memory/...) might indicate more memory than actually available (e.g., via lsmem). [1] https://lists.oasis-open.org/archives/virtio-comment/202006/msg00012.html [2] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com/ [3] https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf [4] https://www.youtube.com/watch?v=H65FDUDPu9s
On 15.07.20 19:38, David Hildenbrand wrote: > On 15.07.20 18:14, Heiko Carstens wrote: >> On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote: >>>> So, are you saying that even at IPL time there might already be memory >>>> devices attached to the system? And the kernel should _not_ treat them >>>> as normal memory? >>> >>> Sorry if that was unclear. Yes, we can have such devices (including >>> memory areas) on a cold boot/reboot/kexec. In addition, they might pop >>> up at runtime (e.g., hotplugging a virtio-mem device). The device is in >>> charge of exposing that area and deciding what to do with it. >>> >>> The kernel should never treat them as normal memory (IOW, system RAM). >>> Not during a cold boot, not during a reboot. The device driver is >>> responsible for deciding how to use that memory (e.g., add it as system >>> RAM), and which parts of that memory are actually valid to be used (even >>> if a tprot might succeed it might not be valid to use just yet - I guess >>> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't >>> want to use it like normal memory). >>> >>> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never >>> exposed via the e820 map. The only trace that there might be *something* >>> now/in the future is indicated via ACPI SRAT tables. This takes >>> currently care of indicating the maximum possible PFN. >> >> Ok, but all of this needa to be documented somewhere. This raises a >> couple of questions to me: > > I assume this mostly targets virtio-mem, because the semantics of > virtio-mem provided memory are extra-weird (in contrast to rather static > virtio-pmem, which is essentially just an emulated NVDIMM - a disk > mapped into physical memory). > > Regarding documentation (some linked in the cover letter), so far I have > (generic/x86-64) > > 1. https://virtio-mem.gitlab.io/ > 2. virtio spec proposal [1] > 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug") > 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug") > 5. Linux cover letter [2] > 6. KVM forum talk [3] [4] > > As your questions go quite into technical detail, and I don't feel like > rewriting the doc here :) , I suggest looking at [2], 1, and 5. Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a comparison to memory ballooning (and DIMM-based memory hotplug). > [3] > https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf > [4] https://www.youtube.com/watch?v=H65FDUDPu9s >
On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote: > > Regarding documentation (some linked in the cover letter), so far I have > > (generic/x86-64) > > > > 1. https://virtio-mem.gitlab.io/ > > 2. virtio spec proposal [1] > > 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug") > > 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug") > > 5. Linux cover letter [2] > > 6. KVM forum talk [3] [4] > > > > As your questions go quite into technical detail, and I don't feel like > > rewriting the doc here :) , I suggest looking at [2], 1, and 5. > > Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a > comparison to memory ballooning (and DIMM-based memory hotplug). Ok, thanks for the pointers! So I would go for what you suggested with option 2: provide a new diagnose which tells the kernel where the memory device area is (probably just start + size?), and leave all other interfaces alone. This looks to me like the by far "cleanest" solution which does not add semantics to existing interfaces, where it is questionable if this wouldn't cause problems in the future.
On 20.07.20 16:43, Heiko Carstens wrote: > On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote: >>> Regarding documentation (some linked in the cover letter), so far I have >>> (generic/x86-64) >>> >>> 1. https://virtio-mem.gitlab.io/ >>> 2. virtio spec proposal [1] >>> 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug") >>> 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug") >>> 5. Linux cover letter [2] >>> 6. KVM forum talk [3] [4] >>> >>> As your questions go quite into technical detail, and I don't feel like >>> rewriting the doc here :) , I suggest looking at [2], 1, and 5. >> >> Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a >> comparison to memory ballooning (and DIMM-based memory hotplug). > > Ok, thanks for the pointers! Thanks for having a look. Once the s390x part is in good shape, I'll add proper documentation (+spec updates regarding exact system reset handling on s390x). > > So I would go for what you suggested with option 2: provide a new > diagnose which tells the kernel where the memory device area is > (probably just start + size?), and leave all other interfaces alone. Ha, that's precisely what I hacked previously today :) Have a new diag500 ("KVM hypercall") subcode (4) to give start+size of the area reserved for memory devices. Will send a new RFC this week to showcase how it would look like. > > This looks to me like the by far "cleanest" solution which does not > add semantics to existing interfaces, where it is questionable if this > wouldn't cause problems in the future. Yes, same thoughts over here!
diff --git a/target/s390x/diag.c b/target/s390x/diag.c index 1a48429564..c3b1e24b2c 100644 --- a/target/s390x/diag.c +++ b/target/s390x/diag.c @@ -23,6 +23,63 @@ #include "hw/s390x/pv.h" #include "kvm_s390x.h" +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra) +{ + MachineState *ms = MACHINE(qdev_get_machine()); + const ram_addr_t initial_ram_size = ms->ram_size; + const uint64_t subcode = env->regs[r3]; + S390CPU *cpu = env_archcpu(env); + ram_addr_t addr, length; + uint64_t tmp; + + /* TODO: Unlock with new QEMU machine. */ + if (false) { + s390_program_interrupt(env, PGM_OPERATION, ra); + return; + } + + /* + * There also seems to be subcode "0xc", which stores the size of the + * first chunk and the total size to r1/r2. It's only used by very old + * Linux, so don't implement it. + */ + if ((r1 & 1) || subcode != 0x10) { + s390_program_interrupt(env, PGM_SPECIFICATION, ra); + return; + } + addr = env->regs[r1]; + length = env->regs[r1 + 1]; + + /* FIXME: Somebody with documentation should fix this. */ + if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) { + s390_program_interrupt(env, PGM_SPECIFICATION, ra); + return; + } + + /* FIXME: Somebody with documentation should fix this. */ + if (!length) { + setcc(cpu, 3); + return; + } + + /* FIXME: Somebody with documentation should fix this. */ + if (!address_space_access_valid(&address_space_memory, addr, length, true, + MEMTXATTRS_UNSPECIFIED)) { + s390_program_interrupt(env, PGM_ADDRESSING, ra); + return; + } + + /* Indicate our initial memory ([0 .. ram_size - 1]) */ + tmp = cpu_to_be64(0); + cpu_physical_memory_write(addr, &tmp, sizeof(tmp)); + tmp = cpu_to_be64(initial_ram_size - 1); + cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp)); + + /* Exactly one entry was stored. */ + env->regs[r3] = 1; + setcc(cpu, 0); +} + int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3) { uint64_t func = env->regs[r1]; diff --git a/target/s390x/internal.h b/target/s390x/internal.h index b1e0ebf67f..a7a3df9a3b 100644 --- a/target/s390x/internal.h +++ b/target/s390x/internal.h @@ -372,6 +372,8 @@ int mmu_translate_real(CPUS390XState *env, target_ulong raddr, int rw, /* misc_helper.c */ +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, + uintptr_t ra); int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3); void handle_diag_308(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra); diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c index f2f75d2a57..d6de3ad86c 100644 --- a/target/s390x/kvm.c +++ b/target/s390x/kvm.c @@ -1565,6 +1565,14 @@ static int handle_hypercall(S390CPU *cpu, struct kvm_run *run) return ret; } +static void kvm_handle_diag_260(S390CPU *cpu, struct kvm_run *run) +{ + const uint64_t r1 = (run->s390_sieic.ipa & 0x00f0) >> 4; + const uint64_t r3 = run->s390_sieic.ipa & 0x000f; + + handle_diag_260(&cpu->env, r1, r3, 0); +} + static void kvm_handle_diag_288(S390CPU *cpu, struct kvm_run *run) { uint64_t r1, r3; @@ -1614,6 +1622,9 @@ static int handle_diag(S390CPU *cpu, struct kvm_run *run, uint32_t ipb) */ func_code = decode_basedisp_rs(&cpu->env, ipb, NULL) & DIAG_KVM_CODE_MASK; switch (func_code) { + case 0x260: + kvm_handle_diag_260(cpu, run); + break; case DIAG_TIMEREVENT: kvm_handle_diag_288(cpu, run); break; diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c index 58dbc023eb..d7274eb320 100644 --- a/target/s390x/misc_helper.c +++ b/target/s390x/misc_helper.c @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num) uint64_t r; switch (num) { + case 0x260: + qemu_mutex_lock_iothread(); + handle_diag_260(env, r1, r3, GETPC()); + qemu_mutex_unlock_iothread(); + r = 0; + break; case 0x500: /* KVM hypercall */ qemu_mutex_lock_iothread(); diff --git a/target/s390x/translate.c b/target/s390x/translate.c index 4f6f1e31cd..6bb8b6e513 100644 --- a/target/s390x/translate.c +++ b/target/s390x/translate.c @@ -2398,6 +2398,10 @@ static DisasJumpType op_diag(DisasContext *s, DisasOps *o) TCGv_i32 func_code = tcg_const_i32(get_field(s, i2)); gen_helper_diag(cpu_env, r1, r3, func_code); + /* Only some diags modify the CC. */ + if (get_field(s, i2) == 0x260) { + set_cc_static(s); + } tcg_temp_free_i32(func_code); tcg_temp_free_i32(r3);
Let's implement the "storage configuration" part of diag260. This diag is found under z/VM, to indicate usable chunks of memory tot he guest OS. As I don't have access to documentation, I have no clue what the actual error cases are, and which other stuff we could eventually query using this interface. Somebody with access to documentation should fix this. This implementation seems to work with Linux guests just fine. The Linux kernel supports diag260 to query the available memory since v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM (with maxmem being defined and bigger than the memory size, e.g., "-m 2G,maxmem=4G"), just as if support for SCLP storage information is not implemented. They will fail to detect the actual initial memory size. This interface allows us to expose the maximum ramsize via sclp and the initial ramsize via diag260 - without having to mess with the memory increment size and having to align the initial memory size to it. This is a preparation for memory device support. We'll unlock the implementation with a new QEMU machine that supports memory devices. Signed-off-by: David Hildenbrand <david@redhat.com> --- target/s390x/diag.c | 57 ++++++++++++++++++++++++++++++++++++++ target/s390x/internal.h | 2 ++ target/s390x/kvm.c | 11 ++++++++ target/s390x/misc_helper.c | 6 ++++ target/s390x/translate.c | 4 +++ 5 files changed, 80 insertions(+)