diff mbox series

[RFC,2/5] s390x: implement diag260

Message ID 20200708185135.46694-3-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series s390x: initial support for virtio-mem | expand

Commit Message

David Hildenbrand July 8, 2020, 6:51 p.m. UTC
Let's implement the "storage configuration" part of diag260. This diag
is found under z/VM, to indicate usable chunks of memory tot he guest OS.
As I don't have access to documentation, I have no clue what the actual
error cases are, and which other stuff we could eventually query using this
interface. Somebody with access to documentation should fix this. This
implementation seems to work with Linux guests just fine.

The Linux kernel supports diag260 to query the available memory since
v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
(with maxmem being defined and bigger than the memory size, e.g., "-m
 2G,maxmem=4G"), just as if support for SCLP storage information is not
implemented. They will fail to detect the actual initial memory size.

This interface allows us to expose the maximum ramsize via sclp
and the initial ramsize via diag260 - without having to mess with the
memory increment size and having to align the initial memory size to it.

This is a preparation for memory device support. We'll unlock the
implementation with a new QEMU machine that supports memory devices.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
 target/s390x/internal.h    |  2 ++
 target/s390x/kvm.c         | 11 ++++++++
 target/s390x/misc_helper.c |  6 ++++
 target/s390x/translate.c   |  4 +++
 5 files changed, 80 insertions(+)

Comments

Cornelia Huck July 9, 2020, 10:37 a.m. UTC | #1
On Wed,  8 Jul 2020 20:51:32 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's implement the "storage configuration" part of diag260. This diag
> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
> As I don't have access to documentation, I have no clue what the actual
> error cases are, and which other stuff we could eventually query using this
> interface. Somebody with access to documentation should fix this. This
> implementation seems to work with Linux guests just fine.
> 
> The Linux kernel supports diag260 to query the available memory since
> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
> (with maxmem being defined and bigger than the memory size, e.g., "-m
>  2G,maxmem=4G"), just as if support for SCLP storage information is not
> implemented. They will fail to detect the actual initial memory size.
> 
> This interface allows us to expose the maximum ramsize via sclp
> and the initial ramsize via diag260 - without having to mess with the
> memory increment size and having to align the initial memory size to it.
> 
> This is a preparation for memory device support. We'll unlock the
> implementation with a new QEMU machine that supports memory devices.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>  target/s390x/internal.h    |  2 ++
>  target/s390x/kvm.c         | 11 ++++++++
>  target/s390x/misc_helper.c |  6 ++++
>  target/s390x/translate.c   |  4 +++
>  5 files changed, 80 insertions(+)
> 
> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
> index 1a48429564..c3b1e24b2c 100644
> --- a/target/s390x/diag.c
> +++ b/target/s390x/diag.c
> @@ -23,6 +23,63 @@
>  #include "hw/s390x/pv.h"
>  #include "kvm_s390x.h"
>  
> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    const ram_addr_t initial_ram_size = ms->ram_size;
> +    const uint64_t subcode = env->regs[r3];
> +    S390CPU *cpu = env_archcpu(env);
> +    ram_addr_t addr, length;
> +    uint64_t tmp;
> +
> +    /* TODO: Unlock with new QEMU machine. */
> +    if (false) {
> +        s390_program_interrupt(env, PGM_OPERATION, ra);
> +        return;
> +    }
> +
> +    /*
> +     * There also seems to be subcode "0xc", which stores the size of the
> +     * first chunk and the total size to r1/r2. It's only used by very old
> +     * Linux, so don't implement it.

FWIW,
https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
seems to list the available subcodes. Anything but 0xc and 0x10 is for
24/31 bit only, so we can safely ignore them. Not sure what we want to
do with 0xc: it is supposed to "Return the highest addressable byte of
virtual storage in the host-primary address space, including named
saved systems and saved segments", so returning the end of the address
space should be easy enough, but not very useful.

> +     */
> +    if ((r1 & 1) || subcode != 0x10) {
> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
> +        return;
> +    }
> +    addr = env->regs[r1];
> +    length = env->regs[r1 + 1];
> +
> +    /* FIXME: Somebody with documentation should fix this. */

Doc mentioned above says for specification exception:

"For subcode X'10':
• Rx is not an even-numbered register.
• The address contained in Rx is not on a quadword boundary.
• The length contained in Rx+1 is not a positive multiple of 16."

> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
> +        return;
> +    }
> +
> +    /* FIXME: Somebody with documentation should fix this. */
> +    if (!length) {

Probably specification exception as well?

> +        setcc(cpu, 3);
> +        return;
> +    }
> +
> +    /* FIXME: Somebody with documentation should fix this. */

For access exception:

"For subcode X'10', an error occurred trying to store the extent
information into the guest's output area."

> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
> +                                    MEMTXATTRS_UNSPECIFIED)) {
> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
> +        return;
> +    }
> +
> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
> +    tmp = cpu_to_be64(0);
> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
> +    tmp = cpu_to_be64(initial_ram_size - 1);
> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
> +
> +    /* Exactly one entry was stored. */
> +    env->regs[r3] = 1;
> +    setcc(cpu, 0);
> +}
> +
>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>  {
>      uint64_t func = env->regs[r1];

(...)

> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
> index 58dbc023eb..d7274eb320 100644
> --- a/target/s390x/misc_helper.c
> +++ b/target/s390x/misc_helper.c
> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>      uint64_t r;
>  
>      switch (num) {
> +    case 0x260:
> +        qemu_mutex_lock_iothread();
> +        handle_diag_260(env, r1, r3, GETPC());
> +        qemu_mutex_unlock_iothread();
> +        r = 0;
> +        break;
>      case 0x500:
>          /* KVM hypercall */
>          qemu_mutex_lock_iothread();

Looking at the doc referenced above, it seems that we treat every diag
call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
to your patch; maybe I'm misreading.)

> diff --git a/target/s390x/translate.c b/target/s390x/translate.c
> index 4f6f1e31cd..6bb8b6e513 100644
> --- a/target/s390x/translate.c
> +++ b/target/s390x/translate.c
> @@ -2398,6 +2398,10 @@ static DisasJumpType op_diag(DisasContext *s, DisasOps *o)
>      TCGv_i32 func_code = tcg_const_i32(get_field(s, i2));
>  
>      gen_helper_diag(cpu_env, r1, r3, func_code);
> +    /* Only some diags modify the CC. */
> +    if (get_field(s, i2) == 0x260) {
> +        set_cc_static(s);
> +    }
>  
>      tcg_temp_free_i32(func_code);
>      tcg_temp_free_i32(r3);
Christian Borntraeger July 9, 2020, 10:52 a.m. UTC | #2
On 08.07.20 20:51, David Hildenbrand wrote:
> Let's implement the "storage configuration" part of diag260. This diag
> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
> As I don't have access to documentation, I have no clue what the actual
> error cases are, and which other stuff we could eventually query using this
> interface. Somebody with access to documentation should fix this. This
> implementation seems to work with Linux guests just fine.
> 
> The Linux kernel supports diag260 to query the available memory since
> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
> (with maxmem being defined and bigger than the memory size, e.g., "-m
>  2G,maxmem=4G"), just as if support for SCLP storage information is not
> implemented. They will fail to detect the actual initial memory size.
> 
> This interface allows us to expose the maximum ramsize via sclp
> and the initial ramsize via diag260 - without having to mess with the
> memory increment size and having to align the initial memory size to it.
> 
> This is a preparation for memory device support. We'll unlock the
> implementation with a new QEMU machine that supports memory devices.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I have not looked into this, so this is purely a question. 

Is there a way to hotplug virtio-mem memory beyond the initial size of 
the memory as specified by the  initial sclp)? then we could avoid doing
this platform specfic diag260?
the only issue I see is when we need to go beyond 4TB due to the page table
upgrade in the kernel. 

FWIW diag 260 is publicly documented.
David Hildenbrand July 9, 2020, 5:54 p.m. UTC | #3
On 09.07.20 12:37, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:32 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>  target/s390x/internal.h    |  2 ++
>>  target/s390x/kvm.c         | 11 ++++++++
>>  target/s390x/misc_helper.c |  6 ++++
>>  target/s390x/translate.c   |  4 +++
>>  5 files changed, 80 insertions(+)
>>
>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>> index 1a48429564..c3b1e24b2c 100644
>> --- a/target/s390x/diag.c
>> +++ b/target/s390x/diag.c
>> @@ -23,6 +23,63 @@
>>  #include "hw/s390x/pv.h"
>>  #include "kvm_s390x.h"
>>  
>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>> +    const uint64_t subcode = env->regs[r3];
>> +    S390CPU *cpu = env_archcpu(env);
>> +    ram_addr_t addr, length;
>> +    uint64_t tmp;
>> +
>> +    /* TODO: Unlock with new QEMU machine. */
>> +    if (false) {
>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * There also seems to be subcode "0xc", which stores the size of the
>> +     * first chunk and the total size to r1/r2. It's only used by very old
>> +     * Linux, so don't implement it.
> 
> FWIW,
> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
> seems to list the available subcodes. Anything but 0xc and 0x10 is for
> 24/31 bit only, so we can safely ignore them. Not sure what we want to
> do with 0xc: it is supposed to "Return the highest addressable byte of
> virtual storage in the host-primary address space, including named
> saved systems and saved segments", so returning the end of the address
> space should be easy enough, but not very useful.

Thanks for the link to the documentation! Either my google search skills
are bad or that stuff is just hard to find :) I'll have a look and see
how to make sense of 0xc. Smells like "maxram_size - 1" indeed.

> 
>> +     */
>> +    if ((r1 & 1) || subcode != 0x10) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +    addr = env->regs[r1];
>> +    length = env->regs[r1 + 1];
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> Doc mentioned above says for specification exception:
> 
> "For subcode X'10':
> • Rx is not an even-numbered register.
> • The address contained in Rx is not on a quadword boundary.
> • The length contained in Rx+1 is not a positive multiple of 16."
> 
>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
>> +    if (!length) {
> 
> Probably specification exception as well?

Yeah I'll add "|| !length" above.

> 
>> +        setcc(cpu, 3);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> For access exception:
> 
> "For subcode X'10', an error occurred trying to store the extent
> information into the guest's output area."
> 

Okay, looks good then!

>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>> +        return;
>> +    }
>> +
>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>> +    tmp = cpu_to_be64(0);
>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>> +
>> +    /* Exactly one entry was stored. */
>> +    env->regs[r3] = 1;
>> +    setcc(cpu, 0);
>> +}
>> +
>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>  {
>>      uint64_t func = env->regs[r1];
> 
> (...)
> 
>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>> index 58dbc023eb..d7274eb320 100644
>> --- a/target/s390x/misc_helper.c
>> +++ b/target/s390x/misc_helper.c
>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>      uint64_t r;
>>  
>>      switch (num) {
>> +    case 0x260:
>> +        qemu_mutex_lock_iothread();
>> +        handle_diag_260(env, r1, r3, GETPC());
>> +        qemu_mutex_unlock_iothread();
>> +        r = 0;
>> +        break;
>>      case 0x500:
>>          /* KVM hypercall */
>>          qemu_mutex_lock_iothread();
> 
> Looking at the doc referenced above, it seems that we treat every diag
> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> to your patch; maybe I'm misreading.)

Interesting. Adding in onto my todo list.

Thanks again!
David Hildenbrand July 9, 2020, 6:15 p.m. UTC | #4
On 09.07.20 12:52, Christian Borntraeger wrote:
> 
> On 08.07.20 20:51, David Hildenbrand wrote:
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I have not looked into this, so this is purely a question. 
> 
> Is there a way to hotplug virtio-mem memory beyond the initial size of 
> the memory as specified by the  initial sclp)? then we could avoid doing
> this platform specfic diag260?

We need a way to tell the guest about the maximum possible PFN, so it
can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
tables. On s390x, the only way I see is using a combination of diag260,
without introducing any other new mechanisms.

Currently Linux selects 3. vs 4 level page tables based on that size (I
think that's what you were referring to with the 4TB limit). I can see
that kasan also does some magic based on the value ("populate kasan
shadow for untracked memory"), but did not look into the details. I
*think* kasan will never be able to track that memory, but am not
completely sure.

I'd like to avoid something as you propose (that's why I searched and
discovered diag260 after all :) ), especially to not silently break in
the future, when other assumptions based on that value are introduced.

E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
default, so it does not seem to be a corner case mechanism nowadays.

> the only issue I see is when we need to go beyond 4TB due to the page table
> upgrade in the kernel. 
> 
> FWIW diag 260 is publicly documented. 

Yeah, Conny pointed me at the doc - makes things easier :)
David Hildenbrand July 10, 2020, 8:32 a.m. UTC | #5
On 09.07.20 12:37, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:32 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>  target/s390x/internal.h    |  2 ++
>>  target/s390x/kvm.c         | 11 ++++++++
>>  target/s390x/misc_helper.c |  6 ++++
>>  target/s390x/translate.c   |  4 +++
>>  5 files changed, 80 insertions(+)
>>
>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>> index 1a48429564..c3b1e24b2c 100644
>> --- a/target/s390x/diag.c
>> +++ b/target/s390x/diag.c
>> @@ -23,6 +23,63 @@
>>  #include "hw/s390x/pv.h"
>>  #include "kvm_s390x.h"
>>  
>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>> +    const uint64_t subcode = env->regs[r3];
>> +    S390CPU *cpu = env_archcpu(env);
>> +    ram_addr_t addr, length;
>> +    uint64_t tmp;
>> +
>> +    /* TODO: Unlock with new QEMU machine. */
>> +    if (false) {
>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * There also seems to be subcode "0xc", which stores the size of the
>> +     * first chunk and the total size to r1/r2. It's only used by very old
>> +     * Linux, so don't implement it.
> 
> FWIW,
> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
> seems to list the available subcodes. Anything but 0xc and 0x10 is for
> 24/31 bit only, so we can safely ignore them. Not sure what we want to
> do with 0xc: it is supposed to "Return the highest addressable byte of
> virtual storage in the host-primary address space, including named
> saved systems and saved segments", so returning the end of the address
> space should be easy enough, but not very useful.
> 
>> +     */
>> +    if ((r1 & 1) || subcode != 0x10) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +    addr = env->regs[r1];
>> +    length = env->regs[r1 + 1];
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> Doc mentioned above says for specification exception:
> 
> "For subcode X'10':
> • Rx is not an even-numbered register.
> • The address contained in Rx is not on a quadword boundary.
> • The length contained in Rx+1 is not a positive multiple of 16."
> 
>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
>> +    if (!length) {
> 
> Probably specification exception as well?
> 
>> +        setcc(cpu, 3);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> For access exception:
> 
> "For subcode X'10', an error occurred trying to store the extent
> information into the guest's output area."
> 
>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>> +        return;
>> +    }
>> +
>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>> +    tmp = cpu_to_be64(0);
>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>> +
>> +    /* Exactly one entry was stored. */
>> +    env->regs[r3] = 1;
>> +    setcc(cpu, 0);
>> +}
>> +
>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>  {
>>      uint64_t func = env->regs[r1];
> 
> (...)
> 
>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>> index 58dbc023eb..d7274eb320 100644
>> --- a/target/s390x/misc_helper.c
>> +++ b/target/s390x/misc_helper.c
>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>      uint64_t r;
>>  
>>      switch (num) {
>> +    case 0x260:
>> +        qemu_mutex_lock_iothread();
>> +        handle_diag_260(env, r1, r3, GETPC());
>> +        qemu_mutex_unlock_iothread();
>> +        r = 0;
>> +        break;
>>      case 0x500:
>>          /* KVM hypercall */
>>          qemu_mutex_lock_iothread();
> 
> Looking at the doc referenced above, it seems that we treat every diag
> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> to your patch; maybe I'm misreading.)

That's also a BUG in kvm then?

int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
{
...
	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
...
}
David Hildenbrand July 10, 2020, 8:41 a.m. UTC | #6
On 10.07.20 10:32, David Hildenbrand wrote:
> On 09.07.20 12:37, Cornelia Huck wrote:
>> On Wed,  8 Jul 2020 20:51:32 +0200
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> Let's implement the "storage configuration" part of diag260. This diag
>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>> As I don't have access to documentation, I have no clue what the actual
>>> error cases are, and which other stuff we could eventually query using this
>>> interface. Somebody with access to documentation should fix this. This
>>> implementation seems to work with Linux guests just fine.
>>>
>>> The Linux kernel supports diag260 to query the available memory since
>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>> implemented. They will fail to detect the actual initial memory size.
>>>
>>> This interface allows us to expose the maximum ramsize via sclp
>>> and the initial ramsize via diag260 - without having to mess with the
>>> memory increment size and having to align the initial memory size to it.
>>>
>>> This is a preparation for memory device support. We'll unlock the
>>> implementation with a new QEMU machine that supports memory devices.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>>  target/s390x/internal.h    |  2 ++
>>>  target/s390x/kvm.c         | 11 ++++++++
>>>  target/s390x/misc_helper.c |  6 ++++
>>>  target/s390x/translate.c   |  4 +++
>>>  5 files changed, 80 insertions(+)
>>>
>>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>>> index 1a48429564..c3b1e24b2c 100644
>>> --- a/target/s390x/diag.c
>>> +++ b/target/s390x/diag.c
>>> @@ -23,6 +23,63 @@
>>>  #include "hw/s390x/pv.h"
>>>  #include "kvm_s390x.h"
>>>  
>>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>>> +{
>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>>> +    const uint64_t subcode = env->regs[r3];
>>> +    S390CPU *cpu = env_archcpu(env);
>>> +    ram_addr_t addr, length;
>>> +    uint64_t tmp;
>>> +
>>> +    /* TODO: Unlock with new QEMU machine. */
>>> +    if (false) {
>>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * There also seems to be subcode "0xc", which stores the size of the
>>> +     * first chunk and the total size to r1/r2. It's only used by very old
>>> +     * Linux, so don't implement it.
>>
>> FWIW,
>> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
>> seems to list the available subcodes. Anything but 0xc and 0x10 is for
>> 24/31 bit only, so we can safely ignore them. Not sure what we want to
>> do with 0xc: it is supposed to "Return the highest addressable byte of
>> virtual storage in the host-primary address space, including named
>> saved systems and saved segments", so returning the end of the address
>> space should be easy enough, but not very useful.
>>
>>> +     */
>>> +    if ((r1 & 1) || subcode != 0x10) {
>>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>>> +        return;
>>> +    }
>>> +    addr = env->regs[r1];
>>> +    length = env->regs[r1 + 1];
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>
>> Doc mentioned above says for specification exception:
>>
>> "For subcode X'10':
>> • Rx is not an even-numbered register.
>> • The address contained in Rx is not on a quadword boundary.
>> • The length contained in Rx+1 is not a positive multiple of 16."
>>
>>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>> +    if (!length) {
>>
>> Probably specification exception as well?
>>
>>> +        setcc(cpu, 3);
>>> +        return;
>>> +    }
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>
>> For access exception:
>>
>> "For subcode X'10', an error occurred trying to store the extent
>> information into the guest's output area."
>>
>>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>>> +    tmp = cpu_to_be64(0);
>>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>>> +
>>> +    /* Exactly one entry was stored. */
>>> +    env->regs[r3] = 1;
>>> +    setcc(cpu, 0);
>>> +}
>>> +
>>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>>  {
>>>      uint64_t func = env->regs[r1];
>>
>> (...)
>>
>>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>>> index 58dbc023eb..d7274eb320 100644
>>> --- a/target/s390x/misc_helper.c
>>> +++ b/target/s390x/misc_helper.c
>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>      uint64_t r;
>>>  
>>>      switch (num) {
>>> +    case 0x260:
>>> +        qemu_mutex_lock_iothread();
>>> +        handle_diag_260(env, r1, r3, GETPC());
>>> +        qemu_mutex_unlock_iothread();
>>> +        r = 0;
>>> +        break;
>>>      case 0x500:
>>>          /* KVM hypercall */
>>>          qemu_mutex_lock_iothread();
>>
>> Looking at the doc referenced above, it seems that we treat every diag
>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>> to your patch; maybe I'm misreading.)
> 
> That's also a BUG in kvm then?
> 
> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> {
> ...
> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> ...
> }
> 

But OTOH, it does not sound sane if user space can bypass the OS to
yield the CPU ... so this might just be a wrong documentation. All DIAGs
should be privileged IIRC.
David Hildenbrand July 10, 2020, 9:17 a.m. UTC | #7
On 09.07.20 20:15, David Hildenbrand wrote:
> On 09.07.20 12:52, Christian Borntraeger wrote:
>>
>> On 08.07.20 20:51, David Hildenbrand wrote:
>>> Let's implement the "storage configuration" part of diag260. This diag
>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>> As I don't have access to documentation, I have no clue what the actual
>>> error cases are, and which other stuff we could eventually query using this
>>> interface. Somebody with access to documentation should fix this. This
>>> implementation seems to work with Linux guests just fine.
>>>
>>> The Linux kernel supports diag260 to query the available memory since
>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>> implemented. They will fail to detect the actual initial memory size.
>>>
>>> This interface allows us to expose the maximum ramsize via sclp
>>> and the initial ramsize via diag260 - without having to mess with the
>>> memory increment size and having to align the initial memory size to it.
>>>
>>> This is a preparation for memory device support. We'll unlock the
>>> implementation with a new QEMU machine that supports memory devices.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> I have not looked into this, so this is purely a question. 
>>
>> Is there a way to hotplug virtio-mem memory beyond the initial size of 
>> the memory as specified by the  initial sclp)? then we could avoid doing
>> this platform specfic diag260?
> 
> We need a way to tell the guest about the maximum possible PFN, so it
> can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
> tables. On s390x, the only way I see is using a combination of diag260,
> without introducing any other new mechanisms.
> 
> Currently Linux selects 3. vs 4 level page tables based on that size (I
> think that's what you were referring to with the 4TB limit). I can see
> that kasan also does some magic based on the value ("populate kasan
> shadow for untracked memory"), but did not look into the details. I
> *think* kasan will never be able to track that memory, but am not
> completely sure.
> 
> I'd like to avoid something as you propose (that's why I searched and
> discovered diag260 after all :) ), especially to not silently break in
> the future, when other assumptions based on that value are introduced.
> 
> E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
> default, so it does not seem to be a corner case mechanism nowadays.
> 

Note: Reading about diag260 subcode 0xc, we could modify Linux to query
the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
indicating maxram size via SCLP, and keep diag260-unaware OSs keep
working as before. Thoughts?
Cornelia Huck July 10, 2020, 9:19 a.m. UTC | #8
On Fri, 10 Jul 2020 10:41:33 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 10.07.20 10:32, David Hildenbrand wrote:
> > On 09.07.20 12:37, Cornelia Huck wrote:  
> >> On Wed,  8 Jul 2020 20:51:32 +0200
> >> David Hildenbrand <david@redhat.com> wrote:

> >>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
> >>> index 58dbc023eb..d7274eb320 100644
> >>> --- a/target/s390x/misc_helper.c
> >>> +++ b/target/s390x/misc_helper.c
> >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
> >>>      uint64_t r;
> >>>  
> >>>      switch (num) {
> >>> +    case 0x260:
> >>> +        qemu_mutex_lock_iothread();
> >>> +        handle_diag_260(env, r1, r3, GETPC());
> >>> +        qemu_mutex_unlock_iothread();
> >>> +        r = 0;
> >>> +        break;
> >>>      case 0x500:
> >>>          /* KVM hypercall */
> >>>          qemu_mutex_lock_iothread();  
> >>
> >> Looking at the doc referenced above, it seems that we treat every diag
> >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> >> to your patch; maybe I'm misreading.)  
> > 
> > That's also a BUG in kvm then?
> > 
> > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> > {
> > ...
> > 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> > 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> > ...
> > }
> >   
> 
> But OTOH, it does not sound sane if user space can bypass the OS to
> yield the CPU ... so this might just be a wrong documentation. All DIAGs
> should be privileged IIRC.

Maybe not all of them, but the diag 0x44 case is indeed odd. No idea
what is documented for its use on LPAR (I don't think that document is
public.)
David Hildenbrand July 10, 2020, 12:12 p.m. UTC | #9
On 10.07.20 11:17, David Hildenbrand wrote:
> On 09.07.20 20:15, David Hildenbrand wrote:
>> On 09.07.20 12:52, Christian Borntraeger wrote:
>>>
>>> On 08.07.20 20:51, David Hildenbrand wrote:
>>>> Let's implement the "storage configuration" part of diag260. This diag
>>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>>> As I don't have access to documentation, I have no clue what the actual
>>>> error cases are, and which other stuff we could eventually query using this
>>>> interface. Somebody with access to documentation should fix this. This
>>>> implementation seems to work with Linux guests just fine.
>>>>
>>>> The Linux kernel supports diag260 to query the available memory since
>>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>>> implemented. They will fail to detect the actual initial memory size.
>>>>
>>>> This interface allows us to expose the maximum ramsize via sclp
>>>> and the initial ramsize via diag260 - without having to mess with the
>>>> memory increment size and having to align the initial memory size to it.
>>>>
>>>> This is a preparation for memory device support. We'll unlock the
>>>> implementation with a new QEMU machine that supports memory devices.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> I have not looked into this, so this is purely a question. 
>>>
>>> Is there a way to hotplug virtio-mem memory beyond the initial size of 
>>> the memory as specified by the  initial sclp)? then we could avoid doing
>>> this platform specfic diag260?
>>
>> We need a way to tell the guest about the maximum possible PFN, so it
>> can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
>> tables. On s390x, the only way I see is using a combination of diag260,
>> without introducing any other new mechanisms.
>>
>> Currently Linux selects 3. vs 4 level page tables based on that size (I
>> think that's what you were referring to with the 4TB limit). I can see
>> that kasan also does some magic based on the value ("populate kasan
>> shadow for untracked memory"), but did not look into the details. I
>> *think* kasan will never be able to track that memory, but am not
>> completely sure.
>>
>> I'd like to avoid something as you propose (that's why I searched and
>> discovered diag260 after all :) ), especially to not silently break in
>> the future, when other assumptions based on that value are introduced.
>>
>> E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
>> default, so it does not seem to be a corner case mechanism nowadays.
>>
> 
> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> working as before. Thoughts?

Implemented it, seems to work fine.
Heiko Carstens July 10, 2020, 3:18 p.m. UTC | #10
On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> > Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> > the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> > indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> > working as before. Thoughts?
> 
> Implemented it, seems to work fine.

The returned value would not include standby/reserved memory within
z/VM. So this seems not to work.
Also: why do you want to change this?
David Hildenbrand July 10, 2020, 3:24 p.m. UTC | #11
On 10.07.20 17:18, Heiko Carstens wrote:
> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>> working as before. Thoughts?
>>
>> Implemented it, seems to work fine.
> 
> The returned value would not include standby/reserved memory within
> z/VM. So this seems not to work.

Which value exactly are you referencing? diag 0xc returns two values.
One of them seems to do exactly what we need.

See
https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7

for my current Linux approach.

> Also: why do you want to change this

Which change exactly do you mean?

If we limit the value returned via SCLP to initial memory, we cannot
break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
purely optional.
Heiko Carstens July 10, 2020, 3:43 p.m. UTC | #12
On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
> On 10.07.20 17:18, Heiko Carstens wrote:
> > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> >>> working as before. Thoughts?
> >>
> >> Implemented it, seems to work fine.
> > 
> > The returned value would not include standby/reserved memory within
> > z/VM. So this seems not to work.
> 
> Which value exactly are you referencing? diag 0xc returns two values.
> One of them seems to do exactly what we need.

Maybe I'm missing something as usual, but to me this
--------
Usage Notes:
...
2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE
command to configure reserved or standby storage for a guest, the
values returned in Rx and Ry will be the current values, but these
values can change dynamically depending on the options specified and
any dynamic storage reconfiguration (DSR) changes initiated by the
guest.
--------
reads like it is not doing what you want. That is: it does *not*
include standby memory and therefore will not return the highest
possible pfn.
David Hildenbrand July 10, 2020, 3:45 p.m. UTC | #13
> Am 10.07.2020 um 17:43 schrieb Heiko Carstens <hca@linux.ibm.com>:
> 
> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>> working as before. Thoughts?
>>>> 
>>>> Implemented it, seems to work fine.
>>> 
>>> The returned value would not include standby/reserved memory within
>>> z/VM. So this seems not to work.
>> 
>> Which value exactly are you referencing? diag 0xc returns two values.
>> One of them seems to do exactly what we need.
> 
> Maybe I'm missing something as usual, but to me this
> --------
> Usage Notes:
> ...
> 2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE
> command to configure reserved or standby storage for a guest, the
> values returned in Rx and Ry will be the current values, but these
> values can change dynamically depending on the options specified and
> any dynamic storage reconfiguration (DSR) changes initiated by the
> guest.
> --------
> reads like it is not doing what you want. That is: it does *not*
> include standby memory and therefore will not return the highest
> possible pfn.
> 

Ah, yes. See the kernel patch, I take the max of both values (SCLP, diag260(0xc)) values.

Anyhow, what would be your recommendation?

Thanks!
Heiko Carstens July 13, 2020, 9:12 a.m. UTC | #14
On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
> On 10.07.20 17:18, Heiko Carstens wrote:
> > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> >>> working as before. Thoughts?
> >>
> >> Implemented it, seems to work fine.
> > 
> > The returned value would not include standby/reserved memory within
> > z/VM. So this seems not to work.
> 
> Which value exactly are you referencing? diag 0xc returns two values.
> One of them seems to do exactly what we need.
> 
> See
> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
> 
> for my current Linux approach.
> 
> > Also: why do you want to change this
> 
> Which change exactly do you mean?
> 
> If we limit the value returned via SCLP to initial memory, we cannot
> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
> purely optional.

Ok, now I see the context. Christian added my just to cc on this
specific patch.
So if I understand you correctly, then you want to use diag 260 in
order to figure out how much memory is _potentially_ available for a
guest?

This does not fit to the current semantics, since diag 260 returns the
address of the highest *currently* accessible address. That is: it
does explicitly *not* include standby memory or anything else that
might potentially be there.

So you would need a different interface to tell the guest about your
new hotplug memory interface. If sclp does not work, then maybe a new
diagnose(?).
David Hildenbrand July 13, 2020, 10:27 a.m. UTC | #15
> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
> 
> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>> working as before. Thoughts?
>>>> 
>>>> Implemented it, seems to work fine.
>>> 
>>> The returned value would not include standby/reserved memory within
>>> z/VM. So this seems not to work.
>> 
>> Which value exactly are you referencing? diag 0xc returns two values.
>> One of them seems to do exactly what we need.
>> 
>> See
>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>> 
>> for my current Linux approach.
>> 
>>> Also: why do you want to change this
>> 
>> Which change exactly do you mean?
>> 
>> If we limit the value returned via SCLP to initial memory, we cannot
>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>> purely optional.
> 
> Ok, now I see the context. Christian added my just to cc on this
> specific patch.

I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).

> So if I understand you correctly, then you want to use diag 260 in
> order to figure out how much memory is _potentially_ available for a
> guest?

Yes, exactly.

> 
> This does not fit to the current semantics, since diag 260 returns the
> address of the highest *currently* accessible address. That is: it
> does explicitly *not* include standby memory or anything else that
> might potentially be there.

The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.

I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)

> 
> So you would need a different interface to tell the guest about your
> new hotplug memory interface. If sclp does not work, then maybe a new
> diagnose(?).
> 

Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?
Christian Borntraeger July 13, 2020, 11:08 a.m. UTC | #16
On 13.07.20 12:27, David Hildenbrand wrote:
> 
> 
>> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
>>
>> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>>> working as before. Thoughts?
>>>>>
>>>>> Implemented it, seems to work fine.
>>>>
>>>> The returned value would not include standby/reserved memory within
>>>> z/VM. So this seems not to work.
>>>
>>> Which value exactly are you referencing? diag 0xc returns two values.
>>> One of them seems to do exactly what we need.
>>>
>>> See
>>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>>>
>>> for my current Linux approach.
>>>
>>>> Also: why do you want to change this
>>>
>>> Which change exactly do you mean?
>>>
>>> If we limit the value returned via SCLP to initial memory, we cannot
>>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>>> purely optional.
>>
>> Ok, now I see the context. Christian added my just to cc on this
>> specific patch.
> 
> I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).
> 
>> So if I understand you correctly, then you want to use diag 260 in
>> order to figure out how much memory is _potentially_ available for a
>> guest?
> 
> Yes, exactly.
> 
>>
>> This does not fit to the current semantics, since diag 260 returns the
>> address of the highest *currently* accessible address. That is: it
>> does explicitly *not* include standby memory or anything else that
>> might potentially be there.
> 
> The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.
> 
> I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)
> 
>>
>> So you would need a different interface to tell the guest about your
>> new hotplug memory interface. If sclp does not work, then maybe a new
>> diagnose(?).
>>
> 
> Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> 

Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address)
And then (when I got the discussion right) use diag 260 to get the _current_ value.
Christian Borntraeger July 13, 2020, 11:54 a.m. UTC | #17
On 10.07.20 10:32, David Hildenbrand wrote:

>>> --- a/target/s390x/misc_helper.c
>>> +++ b/target/s390x/misc_helper.c
>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>      uint64_t r;
>>>  
>>>      switch (num) {
>>> +    case 0x260:
>>> +        qemu_mutex_lock_iothread();
>>> +        handle_diag_260(env, r1, r3, GETPC());
>>> +        qemu_mutex_unlock_iothread();
>>> +        r = 0;
>>> +        break;
>>>      case 0x500:
>>>          /* KVM hypercall */
>>>          qemu_mutex_lock_iothread();
>>
>> Looking at the doc referenced above, it seems that we treat every diag
>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>> to your patch; maybe I'm misreading.)
> 
> That's also a BUG in kvm then?
> 
> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> {
> ...
> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> ...
> }

diag 44 gives a PRIVOP on LPAR, so I think this is fine.
Cornelia Huck July 13, 2020, 12:11 p.m. UTC | #18
On Mon, 13 Jul 2020 13:54:41 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 10.07.20 10:32, David Hildenbrand wrote:
> 
> >>> --- a/target/s390x/misc_helper.c
> >>> +++ b/target/s390x/misc_helper.c
> >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
> >>>      uint64_t r;
> >>>  
> >>>      switch (num) {
> >>> +    case 0x260:
> >>> +        qemu_mutex_lock_iothread();
> >>> +        handle_diag_260(env, r1, r3, GETPC());
> >>> +        qemu_mutex_unlock_iothread();
> >>> +        r = 0;
> >>> +        break;
> >>>      case 0x500:
> >>>          /* KVM hypercall */
> >>>          qemu_mutex_lock_iothread();  
> >>
> >> Looking at the doc referenced above, it seems that we treat every diag
> >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> >> to your patch; maybe I'm misreading.)  
> > 
> > That's also a BUG in kvm then?
> > 
> > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> > {
> > ...
> > 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> > 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> > ...
> > }  
> 
> diag 44 gives a PRIVOP on LPAR, so I think this is fine. 
> 

Seems like a bug/inconsistency in CP (or its documentation), then.
Christian Borntraeger July 13, 2020, 12:13 p.m. UTC | #19
On 13.07.20 14:11, Cornelia Huck wrote:
> On Mon, 13 Jul 2020 13:54:41 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 10.07.20 10:32, David Hildenbrand wrote:
>>
>>>>> --- a/target/s390x/misc_helper.c
>>>>> +++ b/target/s390x/misc_helper.c
>>>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>>>      uint64_t r;
>>>>>  
>>>>>      switch (num) {
>>>>> +    case 0x260:
>>>>> +        qemu_mutex_lock_iothread();
>>>>> +        handle_diag_260(env, r1, r3, GETPC());
>>>>> +        qemu_mutex_unlock_iothread();
>>>>> +        r = 0;
>>>>> +        break;
>>>>>      case 0x500:
>>>>>          /* KVM hypercall */
>>>>>          qemu_mutex_lock_iothread();  
>>>>
>>>> Looking at the doc referenced above, it seems that we treat every diag
>>>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>>>> to your patch; maybe I'm misreading.)  
>>>
>>> That's also a BUG in kvm then?
>>>
>>> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
>>> {
>>> ...
>>> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
>>> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
>>> ...
>>> }  
>>
>> diag 44 gives a PRIVOP on LPAR, so I think this is fine. 
>>
> 
> Seems like a bug/inconsistency in CP (or its documentation), then.

Yes. 

.globl main
main:
        diag 0,0,0x44
        svc 1



also crashes under z/VM with an illegal op.
David Hildenbrand July 15, 2020, 9:42 a.m. UTC | #20
On 13.07.20 13:08, Christian Borntraeger wrote:
> On 13.07.20 12:27, David Hildenbrand wrote:
>>
>>
>>> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
>>>
>>> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>>>> working as before. Thoughts?
>>>>>>
>>>>>> Implemented it, seems to work fine.
>>>>>
>>>>> The returned value would not include standby/reserved memory within
>>>>> z/VM. So this seems not to work.
>>>>
>>>> Which value exactly are you referencing? diag 0xc returns two values.
>>>> One of them seems to do exactly what we need.
>>>>
>>>> See
>>>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>>>>
>>>> for my current Linux approach.
>>>>
>>>>> Also: why do you want to change this
>>>>
>>>> Which change exactly do you mean?
>>>>
>>>> If we limit the value returned via SCLP to initial memory, we cannot
>>>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>>>> purely optional.
>>>
>>> Ok, now I see the context. Christian added my just to cc on this
>>> specific patch.
>>
>> I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).
>>
>>> So if I understand you correctly, then you want to use diag 260 in
>>> order to figure out how much memory is _potentially_ available for a
>>> guest?
>>
>> Yes, exactly.
>>
>>>
>>> This does not fit to the current semantics, since diag 260 returns the
>>> address of the highest *currently* accessible address. That is: it
>>> does explicitly *not* include standby memory or anything else that
>>> might potentially be there.
>>
>> The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.
>>
>> I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)
>>
>>>
>>> So you would need a different interface to tell the guest about your
>>> new hotplug memory interface. If sclp does not work, then maybe a new
>>> diagnose(?).
>>>
>>
>> Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> 
> 
> Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address)
> And then (when I got the discussion right) use diag 260 to get the _current_ value.

So, in summary, we want to indicate to the guest a memory region that
will be used to place memory devices ("device memory region"). The
region might have holes and the memory within this region might have
different semantics than ordinary system memory. Memory that belongs to
memory devices should only be detected+used if the guest OS has support
for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
(e.g., no virtio-mem driver) should not accidentally make use of such
memory.

We need a way to
a) Tell the guest about boot memory (currently ram_size)
b) Tell the guest about the maximum possible ram address, including
device memory. (We could also indicate the special "device memory
region" explicitly)


AFAIK, we have three options:


1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)

This is what this series (RFCv1 does).

Advantages:
- No need for a new diag. No need for memory sensing kernel changes.
Disadvantages
- Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
  assume all memory is accessible. Bad.
- The semantics of the value returned in ry via diag260(0xc) is somewhat
  unclear. Should we return the end address of the highest memory
  device? OTOH, an unmodified guest OS (without support for memory
  devices) should not have to care at all about any such memory.
- If we ever want to also support standby memory, we might be in
  trouble. (see below)

2. Indicate ram_size via SCLP, indicate device memory region
   (currently maxram_size) via new DIAG

Advantages:
- Unmodified guests won't use/sense memory belonging to memory devices.
- We can later have standby memory + memory devices co-exist.
Disadvantages
- Need a new DIAG.

3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
   memory)

I did not look into the details, because -ENODOCUMENTATION. At least we
would run into some alignment issues (again, having to align
ram_size/maxram_size to storage increments - which would no longer be
1MB). We would run into issues later, trying to also support standby memory.



I guess 1) would mostly work, one just has to run a suitable guest
inside the VM. This is no different to running under z/VM where querying
diag260 is required. The nice thing about 2) would be, that we can
easily implement standby memory. Something like:

-m 2G,maxram_size=20G,standbyram_size=4G

[ 2G boot RAM ][ 4G standby RAM ][ 14G device memory ]
                                 ^ via SCLP maximum increment
                                                     ^ via new DIAG
Heiko Carstens July 15, 2020, 10:43 a.m. UTC | #21
On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote:
> So, in summary, we want to indicate to the guest a memory region that
> will be used to place memory devices ("device memory region"). The
> region might have holes and the memory within this region might have
> different semantics than ordinary system memory. Memory that belongs to
> memory devices should only be detected+used if the guest OS has support
> for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
> (e.g., no virtio-mem driver) should not accidentally make use of such
> memory.
> 
> We need a way to
> a) Tell the guest about boot memory (currently ram_size)
> b) Tell the guest about the maximum possible ram address, including
> device memory. (We could also indicate the special "device memory
> region" explicitly)
> 
> AFAIK, we have three options:
> 
> 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)
> 
> This is what this series (RFCv1 does).
> 
> Advantages:
> - No need for a new diag. No need for memory sensing kernel changes.
> Disadvantages
> - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
>   assume all memory is accessible. Bad.

Why would old guests assume that?

At least in v4.1 the kernel will calculate the max address by using
increment size * increment number and then test if *each* increment is
available with tprot.

> - The semantics of the value returned in ry via diag260(0xc) is somewhat
>   unclear. Should we return the end address of the highest memory
>   device? OTOH, an unmodified guest OS (without support for memory
>   devices) should not have to care at all about any such memory.

I'm confused. The kernel currently only uses diag260(0x10). How is
diag260(0xc) relevant here?

> 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
>    memory)
> 
> I did not look into the details, because -ENODOCUMENTATION. At least we
> would run into some alignment issues (again, having to align
> ram_size/maxram_size to storage increments - which would no longer be
> 1MB). We would run into issues later, trying to also support standby memory.

That doesn't make sense to me: either support memory hotplug via
sclp/standby memory, or with your new method. But trying to support
both.. what's the use case?
David Hildenbrand July 15, 2020, 11:21 a.m. UTC | #22
On 15.07.20 12:43, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote:
>> So, in summary, we want to indicate to the guest a memory region that
>> will be used to place memory devices ("device memory region"). The
>> region might have holes and the memory within this region might have
>> different semantics than ordinary system memory. Memory that belongs to
>> memory devices should only be detected+used if the guest OS has support
>> for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
>> (e.g., no virtio-mem driver) should not accidentally make use of such
>> memory.
>>
>> We need a way to
>> a) Tell the guest about boot memory (currently ram_size)
>> b) Tell the guest about the maximum possible ram address, including
>> device memory. (We could also indicate the special "device memory
>> region" explicitly)
>>
>> AFAIK, we have three options:
>>
>> 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)
>>
>> This is what this series (RFCv1 does).
>>
>> Advantages:
>> - No need for a new diag. No need for memory sensing kernel changes.
>> Disadvantages
>> - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
>>   assume all memory is accessible. Bad.
> 
> Why would old guests assume that?
> 
> At least in v4.1 the kernel will calculate the max address by using
> increment size * increment number and then test if *each* increment is
> available with tprot.

Yes, we do the same in kvm-unit-tests. But it's not sufficient for
memory devices.

Just because a tprot succeed (for memory belonging to a memory device)
does not mean the kernel should silently start to use that memory.

Note: memory devices are not just DIMMs that can be mapped to storage
increments. The memory might have completely different semantics, that's
why they are glued to a managing virtio device.

For example: a tprot might succeed on a memory region provided by
virtio-mem, this does, however, not mean that the memory can (and
should) be used by the guest.

> 
>> - The semantics of the value returned in ry via diag260(0xc) is somewhat
>>   unclear. Should we return the end address of the highest memory
>>   device? OTOH, an unmodified guest OS (without support for memory
>>   devices) should not have to care at all about any such memory.
> 
> I'm confused. The kernel currently only uses diag260(0x10). How is
> diag260(0xc) relevant here?

We have to implement diag260(0x10) if we implement diag260(0xc), no? Or
can we simply throw a specification exception?

> 
>> 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
>>    memory)
>>
>> I did not look into the details, because -ENODOCUMENTATION. At least we
>> would run into some alignment issues (again, having to align
>> ram_size/maxram_size to storage increments - which would no longer be
>> 1MB). We would run into issues later, trying to also support standby memory.
> 
> That doesn't make sense to me: either support memory hotplug via
> sclp/standby memory, or with your new method. But trying to support
> both.. what's the use case?

Not sure if there is any, it just feels cleaner to me to separate the
architectured (sclp memory/reserved/standby) bits that specify a
semantic when used via rnmax+tprot from QEMU specific memory ranges that
have special semantics.

virtio-mem is only one type of a virtio-based memory device. In the
future we might want to have virtio-pmem, but there might be more ...
Heiko Carstens July 15, 2020, 11:34 a.m. UTC | #23
On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote:
> > At least in v4.1 the kernel will calculate the max address by using
> > increment size * increment number and then test if *each* increment is
> > available with tprot.
> 
> Yes, we do the same in kvm-unit-tests. But it's not sufficient for
> memory devices.
> 
> Just because a tprot succeed (for memory belonging to a memory device)
> does not mean the kernel should silently start to use that memory.
> 
> Note: memory devices are not just DIMMs that can be mapped to storage
> increments. The memory might have completely different semantics, that's
> why they are glued to a managing virtio device.
> 
> For example: a tprot might succeed on a memory region provided by
> virtio-mem, this does, however, not mean that the memory can (and
> should) be used by the guest.

So, are you saying that even at IPL time there might already be memory
devices attached to the system? And the kernel should _not_ treat them
as normal memory?
David Hildenbrand July 15, 2020, 11:42 a.m. UTC | #24
On 15.07.20 13:34, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote:
>>> At least in v4.1 the kernel will calculate the max address by using
>>> increment size * increment number and then test if *each* increment is
>>> available with tprot.
>>
>> Yes, we do the same in kvm-unit-tests. But it's not sufficient for
>> memory devices.
>>
>> Just because a tprot succeed (for memory belonging to a memory device)
>> does not mean the kernel should silently start to use that memory.
>>
>> Note: memory devices are not just DIMMs that can be mapped to storage
>> increments. The memory might have completely different semantics, that's
>> why they are glued to a managing virtio device.
>>
>> For example: a tprot might succeed on a memory region provided by
>> virtio-mem, this does, however, not mean that the memory can (and
>> should) be used by the guest.
> 
> So, are you saying that even at IPL time there might already be memory
> devices attached to the system? And the kernel should _not_ treat them
> as normal memory?

Sorry if that was unclear. Yes, we can have such devices (including
memory areas) on a cold boot/reboot/kexec. In addition, they might pop
up at runtime (e.g., hotplugging a virtio-mem device). The device is in
charge of exposing that area and deciding what to do with it.

The kernel should never treat them as normal memory (IOW, system RAM).
Not during a cold boot, not during a reboot. The device driver is
responsible for deciding how to use that memory (e.g., add it as system
RAM), and which parts of that memory are actually valid to be used (even
if a tprot might succeed it might not be valid to use just yet - I guess
somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
want to use it like normal memory).

E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
exposed via the e820 map. The only trace that there might be *something*
now/in the future is indicated via ACPI SRAT tables. This takes
currently care of indicating the maximum possible PFN.
Heiko Carstens July 15, 2020, 4:14 p.m. UTC | #25
On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
> > So, are you saying that even at IPL time there might already be memory
> > devices attached to the system? And the kernel should _not_ treat them
> > as normal memory?
> 
> Sorry if that was unclear. Yes, we can have such devices (including
> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
> charge of exposing that area and deciding what to do with it.
> 
> The kernel should never treat them as normal memory (IOW, system RAM).
> Not during a cold boot, not during a reboot. The device driver is
> responsible for deciding how to use that memory (e.g., add it as system
> RAM), and which parts of that memory are actually valid to be used (even
> if a tprot might succeed it might not be valid to use just yet - I guess
> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
> want to use it like normal memory).
> 
> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
> exposed via the e820 map. The only trace that there might be *something*
> now/in the future is indicated via ACPI SRAT tables. This takes
> currently care of indicating the maximum possible PFN.

Ok, but all of this needa to be documented somewhere. This raises a
couple of questions to me:

What happens on

- IPL Clear with this special memory? Will it be detached/away afterwards?
- IPL Normal? "Obviously" it must stay otherwise kdump would never see
  that memory.

And when you write it's up to the device driver what to with that
memory: is there any documentation available what all of this is good
for? I would assume _most likely_ this extra memory is going to be
added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But
since it is not normal memory, like you say, I'm wondering how that is
supposed to work.

As far as I can tell there would be a lot of inconsistencies in
userspace interfaces which provide memory / zone information. Or I'm
not getting the point of all of this at all.

So please provide more information, or a pointer to documentation.
David Hildenbrand July 15, 2020, 5:38 p.m. UTC | #26
On 15.07.20 18:14, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
>>> So, are you saying that even at IPL time there might already be memory
>>> devices attached to the system? And the kernel should _not_ treat them
>>> as normal memory?
>>
>> Sorry if that was unclear. Yes, we can have such devices (including
>> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
>> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
>> charge of exposing that area and deciding what to do with it.
>>
>> The kernel should never treat them as normal memory (IOW, system RAM).
>> Not during a cold boot, not during a reboot. The device driver is
>> responsible for deciding how to use that memory (e.g., add it as system
>> RAM), and which parts of that memory are actually valid to be used (even
>> if a tprot might succeed it might not be valid to use just yet - I guess
>> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
>> want to use it like normal memory).
>>
>> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
>> exposed via the e820 map. The only trace that there might be *something*
>> now/in the future is indicated via ACPI SRAT tables. This takes
>> currently care of indicating the maximum possible PFN.
> 
> Ok, but all of this needa to be documented somewhere. This raises a
> couple of questions to me:

I assume this mostly targets virtio-mem, because the semantics of
virtio-mem provided memory are extra-weird (in contrast to rather static
virtio-pmem, which is essentially just an emulated NVDIMM - a disk
mapped into physical memory).

Regarding documentation (some linked in the cover letter), so far I have
(generic/x86-64)

1. https://virtio-mem.gitlab.io/
2. virtio spec proposal [1]
3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
5. Linux cover letter [2]
6. KVM forum talk [3] [4]

As your questions go quite into technical detail, and I don't feel like
rewriting the doc here :) , I suggest looking at [2], 1, and 5.

> 
> What happens on

I'll stick to virtio-mem when answering regarding "special memory". As I
noted, there might be more in the future.

> 
> - IPL Clear with this special memory? Will it be detached/away afterwards?

A diag308(0x3) - load clear - will usually* zap all virtio-mem provided
memory (discard backing storage in the hypervisor) and logically turn
the state of all virtio-mem memory inside the device-assigned memory
region to "unplugged" - just as during a cold boot. The semantics of
"unplugged" blocks depend on the "usable region" (see the virtio-spec if
you're curious - the memory might still be accessible). Starting "fresh"
with all memory logically unplugged is part of the way virtio-mem works.

* there are corner cases while a VM is getting migrated, where we cannot
perform this (similar, to us not being able to clear ordinary memory
during a load clear in QEMU while migrating). In this case, the memory
is left untouched.

> - IPL Normal? "Obviously" it must stay otherwise kdump would never see
>   that memory.

Only diag308(0x3) will mess with virtio-mem memory. For the other types
of resets, its left untouched. So yes, "obviously" is correct :)

> 
> And when you write it's up to the device driver what to with that
> memory: is there any documentation available what all of this is good
> for? I would assume _most likely_ this extra memory is going to be
> added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But
> since it is not normal memory, like you say, I'm wondering how that is
> supposed to work.

For now

1. virtio-mem adds all (possible) aligned memory via add_memory() to Linux
2. Requires user space to online the memory blocks / configure a zone.

For 2., only ZONE_NORMAL really works right now and is recommended to
use. As you correctly note, that does not give you any guarantees how
much memory you can unplug again (e.g, fragmentation with unmovable
data), but is good enough for the first version (with focus on memory
hotplug, not unplug). ZONE_MOVABLE support is in the works.

However, we cannot blindly expose all memory to ZONE_MOVABLE (zone
imbalances leading to rashes), and sometimes also don't want to (e.g.,
gigantic pages). Without spoilering too much, a mixture would be nice.

> 
> As far as I can tell there would be a lot of inconsistencies in
> userspace interfaces which provide memory / zone information. Or I'm
> not getting the point of all of this at all.

All memory/zone stats are properly fixed up (similar to ballooning). The
only visible inconsistency that *might* happen when unplugging memory /
hotplugging memory in <256MB on s390x, is that the number of memory
block devices (/sys/devices/system/memory/...) might indicate more
memory than actually available (e.g., via lsmem).


[1]
https://lists.oasis-open.org/archives/virtio-comment/202006/msg00012.html
[2] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com/
[3]
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
[4] https://www.youtube.com/watch?v=H65FDUDPu9s
David Hildenbrand July 15, 2020, 5:51 p.m. UTC | #27
On 15.07.20 19:38, David Hildenbrand wrote:
> On 15.07.20 18:14, Heiko Carstens wrote:
>> On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
>>>> So, are you saying that even at IPL time there might already be memory
>>>> devices attached to the system? And the kernel should _not_ treat them
>>>> as normal memory?
>>>
>>> Sorry if that was unclear. Yes, we can have such devices (including
>>> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
>>> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
>>> charge of exposing that area and deciding what to do with it.
>>>
>>> The kernel should never treat them as normal memory (IOW, system RAM).
>>> Not during a cold boot, not during a reboot. The device driver is
>>> responsible for deciding how to use that memory (e.g., add it as system
>>> RAM), and which parts of that memory are actually valid to be used (even
>>> if a tprot might succeed it might not be valid to use just yet - I guess
>>> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
>>> want to use it like normal memory).
>>>
>>> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
>>> exposed via the e820 map. The only trace that there might be *something*
>>> now/in the future is indicated via ACPI SRAT tables. This takes
>>> currently care of indicating the maximum possible PFN.
>>
>> Ok, but all of this needa to be documented somewhere. This raises a
>> couple of questions to me:
> 
> I assume this mostly targets virtio-mem, because the semantics of
> virtio-mem provided memory are extra-weird (in contrast to rather static
> virtio-pmem, which is essentially just an emulated NVDIMM - a disk
> mapped into physical memory).
> 
> Regarding documentation (some linked in the cover letter), so far I have
> (generic/x86-64)
> 
> 1. https://virtio-mem.gitlab.io/
> 2. virtio spec proposal [1]
> 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
> 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
> 5. Linux cover letter [2]
> 6. KVM forum talk [3] [4]
> 
> As your questions go quite into technical detail, and I don't feel like
> rewriting the doc here :) , I suggest looking at [2], 1, and 5.

Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
comparison to memory ballooning (and DIMM-based memory hotplug).


> [3]
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
> [4] https://www.youtube.com/watch?v=H65FDUDPu9s
>
Heiko Carstens July 20, 2020, 2:43 p.m. UTC | #28
On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote:
> > Regarding documentation (some linked in the cover letter), so far I have
> > (generic/x86-64)
> > 
> > 1. https://virtio-mem.gitlab.io/
> > 2. virtio spec proposal [1]
> > 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
> > 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
> > 5. Linux cover letter [2]
> > 6. KVM forum talk [3] [4]
> > 
> > As your questions go quite into technical detail, and I don't feel like
> > rewriting the doc here :) , I suggest looking at [2], 1, and 5.
> 
> Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
> comparison to memory ballooning (and DIMM-based memory hotplug).

Ok, thanks for the pointers!

So I would go for what you suggested with option 2: provide a new
diagnose which tells the kernel where the memory device area is
(probably just start + size?), and leave all other interfaces alone.

This looks to me like the by far "cleanest" solution which does not
add semantics to existing interfaces, where it is questionable if this
wouldn't cause problems in the future.
David Hildenbrand July 20, 2020, 3:43 p.m. UTC | #29
On 20.07.20 16:43, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote:
>>> Regarding documentation (some linked in the cover letter), so far I have
>>> (generic/x86-64)
>>>
>>> 1. https://virtio-mem.gitlab.io/
>>> 2. virtio spec proposal [1]
>>> 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
>>> 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
>>> 5. Linux cover letter [2]
>>> 6. KVM forum talk [3] [4]
>>>
>>> As your questions go quite into technical detail, and I don't feel like
>>> rewriting the doc here :) , I suggest looking at [2], 1, and 5.
>>
>> Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
>> comparison to memory ballooning (and DIMM-based memory hotplug).
> 
> Ok, thanks for the pointers!

Thanks for having a look. Once the s390x part is in good shape, I'll add
proper documentation (+spec updates regarding exact system reset
handling on s390x).

> 
> So I would go for what you suggested with option 2: provide a new
> diagnose which tells the kernel where the memory device area is
> (probably just start + size?), and leave all other interfaces alone.

Ha, that's precisely what I hacked previously today :) Have a new
diag500 ("KVM hypercall") subcode (4) to give start+size of the area
reserved for memory devices. Will send a new RFC this week to showcase
how it would look like.

> 
> This looks to me like the by far "cleanest" solution which does not
> add semantics to existing interfaces, where it is questionable if this
> wouldn't cause problems in the future.

Yes, same thoughts over here!
diff mbox series

Patch

diff --git a/target/s390x/diag.c b/target/s390x/diag.c
index 1a48429564..c3b1e24b2c 100644
--- a/target/s390x/diag.c
+++ b/target/s390x/diag.c
@@ -23,6 +23,63 @@ 
 #include "hw/s390x/pv.h"
 #include "kvm_s390x.h"
 
+void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    const ram_addr_t initial_ram_size = ms->ram_size;
+    const uint64_t subcode = env->regs[r3];
+    S390CPU *cpu = env_archcpu(env);
+    ram_addr_t addr, length;
+    uint64_t tmp;
+
+    /* TODO: Unlock with new QEMU machine. */
+    if (false) {
+        s390_program_interrupt(env, PGM_OPERATION, ra);
+        return;
+    }
+
+    /*
+     * There also seems to be subcode "0xc", which stores the size of the
+     * first chunk and the total size to r1/r2. It's only used by very old
+     * Linux, so don't implement it.
+     */
+    if ((r1 & 1) || subcode != 0x10) {
+        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
+        return;
+    }
+    addr = env->regs[r1];
+    length = env->regs[r1 + 1];
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
+        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
+        return;
+    }
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!length) {
+        setcc(cpu, 3);
+        return;
+    }
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!address_space_access_valid(&address_space_memory, addr, length, true,
+                                    MEMTXATTRS_UNSPECIFIED)) {
+        s390_program_interrupt(env, PGM_ADDRESSING, ra);
+        return;
+    }
+
+    /* Indicate our initial memory ([0 .. ram_size - 1]) */
+    tmp = cpu_to_be64(0);
+    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
+    tmp = cpu_to_be64(initial_ram_size - 1);
+    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
+
+    /* Exactly one entry was stored. */
+    env->regs[r3] = 1;
+    setcc(cpu, 0);
+}
+
 int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
 {
     uint64_t func = env->regs[r1];
diff --git a/target/s390x/internal.h b/target/s390x/internal.h
index b1e0ebf67f..a7a3df9a3b 100644
--- a/target/s390x/internal.h
+++ b/target/s390x/internal.h
@@ -372,6 +372,8 @@  int mmu_translate_real(CPUS390XState *env, target_ulong raddr, int rw,
 
 
 /* misc_helper.c */
+void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3,
+                     uintptr_t ra);
 int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3);
 void handle_diag_308(CPUS390XState *env, uint64_t r1, uint64_t r3,
                      uintptr_t ra);
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index f2f75d2a57..d6de3ad86c 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -1565,6 +1565,14 @@  static int handle_hypercall(S390CPU *cpu, struct kvm_run *run)
     return ret;
 }
 
+static void kvm_handle_diag_260(S390CPU *cpu, struct kvm_run *run)
+{
+    const uint64_t r1 = (run->s390_sieic.ipa & 0x00f0) >> 4;
+    const uint64_t r3 = run->s390_sieic.ipa & 0x000f;
+
+    handle_diag_260(&cpu->env, r1, r3, 0);
+}
+
 static void kvm_handle_diag_288(S390CPU *cpu, struct kvm_run *run)
 {
     uint64_t r1, r3;
@@ -1614,6 +1622,9 @@  static int handle_diag(S390CPU *cpu, struct kvm_run *run, uint32_t ipb)
      */
     func_code = decode_basedisp_rs(&cpu->env, ipb, NULL) & DIAG_KVM_CODE_MASK;
     switch (func_code) {
+    case 0x260:
+        kvm_handle_diag_260(cpu, run);
+        break;
     case DIAG_TIMEREVENT:
         kvm_handle_diag_288(cpu, run);
         break;
diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
index 58dbc023eb..d7274eb320 100644
--- a/target/s390x/misc_helper.c
+++ b/target/s390x/misc_helper.c
@@ -116,6 +116,12 @@  void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
     uint64_t r;
 
     switch (num) {
+    case 0x260:
+        qemu_mutex_lock_iothread();
+        handle_diag_260(env, r1, r3, GETPC());
+        qemu_mutex_unlock_iothread();
+        r = 0;
+        break;
     case 0x500:
         /* KVM hypercall */
         qemu_mutex_lock_iothread();
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index 4f6f1e31cd..6bb8b6e513 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -2398,6 +2398,10 @@  static DisasJumpType op_diag(DisasContext *s, DisasOps *o)
     TCGv_i32 func_code = tcg_const_i32(get_field(s, i2));
 
     gen_helper_diag(cpu_env, r1, r3, func_code);
+    /* Only some diags modify the CC. */
+    if (get_field(s, i2) == 0x260) {
+        set_cc_static(s);
+    }
 
     tcg_temp_free_i32(func_code);
     tcg_temp_free_i32(r3);