Message ID | ZBpe6btfLuuAS35g@memverge.com |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] cxl: Multi-headed device design | expand |
On Tue, 21 Mar 2023 21:50:33 -0400 Gregory Price <gregory.price@memverge.com> wrote: Hi Gregory, Sorry I took so long to reply to this. Busy month... Vince presented at LSF-MM so I feel it's fair game to CC him kernel patches and he may be able to point you in right direction for a few things in this mail. > Originally I was planning to kick this off with a patch set, but i've > decided my current prototype does not fit the extensibility requirements > to go from SLD to MH-SLD to MH-MLD. > > > So instead I'd like to kick off by just discussing the data structures > and laugh/cry a bit about some of the frustrating ambiguities for MH-SLDs > when it comes to the specification. > > I apologize for the sheer length of this email, but it really is just > that complex. hehe. I read this far when you first sent it and decided to put it on the todo list rather than reading the rest ;) > > > ============================================================= > What does the specification say about Multi-headed Devices? > ============================================================= > > Defining each relevant component according to the specification: > > > > > VCS - Virtual CXL Switch > > * Includes entities within the physical switch belonging to a > > single VH. It is identified using the VCS ID. > > > > > > VH - Virtual Hierarchy. > > * Everything from the CXL RP down. > > > > > > LD - Logical Device > > * Entity that represents a CXL Endpoint that is bound to a VCS. > > An SLD device contains one LD. An MLD contains multiple LDs. > > > > > > SLD - Single Logical Device > > * That's it, that's the definition. > > > > > > MLD - Multi Logical Device > > * Multi-Logical Device. CXL component that contains multiple LDs, > > out of which one LD is reserved for configuration via the FM API, > > and each remaining LD is suitable for assignment to a different > > host. Currently MLDs are architected only for Type 3 LDs. > > > > > > MH-SLD - Mutli-Headed SLD > > * CXL component that contains multiple CXL ports, each presenting an > > SLD. The ports must correctly operate when connected to any > > combination of common or different hosts. > > > > > > MH-MLD - Multi-Headed MLD > > * CXL component that contains multiple CXL ports, each presenting an MLD > > or SLD. The ports must correctly operate when connected to any > > combination of common or different hosts. The FM-API is used to > > configure each LD as well as the overall MH-MLD. > > > > MH-MLDs are considered a specialized type of MLD and, as such, are > > subject to all functional and behavioral requirements of MLDs. > > > > Ambiguity #1: > > * An SLD contains 1 Logical Device. > * An MH-SLD presents multiple SLDs, one per head. > > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the > definition of LD, but not according to the definition of MLD, or MH-MLD. I'd go with 'sort of'. SLD is a presentation of a device to a host. It can be a normal single headed MLD that has been plugged directly into a host. So for extra fun points you can have one MH-MLD that has some ports connected to switches and other directly to hosts. Thus it can present as SLD on some upstream ports and as MLD on others. > > Now is the winter of my discontent. > > The Specification says this about MH-SLD's in other sections > > > 2.4.3 Pooled and Shared FAM > > > > LD-FAM includes several device variants. > > > > A multi-headed Single Logical Device (MH-SLD) exposes multiple LDs, each with > > a dedicated link. > > > > > > 2.5 Multi-Headed Device > > > > There are two types of Multi-Headed Devices that are distinguied by how > > they present themselves on each head: > > * MH-SLD, which present SLDs on all head > > * MH-MLD, which may present MLDs on any of their heads Yup. MH-SLD is the cheap device - not capable of MLD support to any upstream port - so it can skip some functionality. > > > > > > Management of heads in Multi-Headed Devices follows the model defined for > > the device presented by that head: > > * Heads that present SLDs may support the port management and control > > features that are available for SLDs > > * Heads that present MLDs may support the port management and control > > features that are available for MLDs > > > > I want to make very close note of this. SLD's are managed like SLDs > SLDs, MLDs are managed like MLDs. MH-SLDs, according to this, should be > managed like SLDs from the perspective of each host. True, but an MH-MLD device connected directly to a host will also be managed (at some level anyway) as an SLD on that particular port. > > That's pretty straight forward. > > > > > Management of memory resources in Multi-Headed Devices follows the model > > defined for MLD components because both MH-SLDs and MH-MLDs must support > > the isolation of memory resources, state, context, and management on a > > per-LD basis. LDs within the device are mapped to a single head. > > > > * In MH-SLDs, there is a 1:1 mapping between heads and LDs. > > * In MH-MLDs, multiple LDs are mapped to at most one head. > > > > > > Multi-Headed Devices expose a dedicated Component Command Interface (CCI), > > the LD Pool CCI, for management of all LDs within the device. The LD Pool > > CCI may be exposed as an MCTP-based CCI or can be accessed via the Tunnel > > Management Command command through a head’s Mailbox CCI, as detailed in > > Section 7.6.7.3.1. > > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores > that MH-SLDs (may) exist. That's frustrating to say the least, but I > suppose we can gather from context that MH-SLD's *MAY NOT* have LD > management controls. Hmm. In theory you could have an MH-SLD that used a config from flash or similar but that would be odd. We need some level of dynamic control to make these devices useful. Doesn't mean the spec should exclude dumb devices, but we shouldn't concentrate on them for emulation. One possible usecase would be a device that always shares all it's memory on all ports. Yuk. > > Lets see if that assumption holds. > > > 7.6.7.3 MLD Port Command Set > > > > 7.6.7.3.1 Tunnel Management Command (Opcode 5300h) > > The referenced section at the end of 2.5 seems to also suggest that > MH-SLDs do not (or don't have to?) implement the tunnel management > command set. It sends us to the MLD command set, and SLDs don't get > managed like MLDs - ergo it's not relevant? > > The final mention of MH-SLDs is mentioned in section 9.13.3 > > > 9.13.3 Dynamic Capacity Device > > ... > > MH-SLD or MH-MLD based DCD shall forcefully release shared Dynamic > > Capacity associated with all associated hosts upon a Conventional Reset > > of a head. > > > > From this we can gather that the specification foresaw someone making a > memory pool from an MH-SLD... but without LD management. We can fill in > some blanks and assume that if someone wanted to, they could make a > shared memory device and implement pooling via software controls. When you say software controls? I'm not sure I follow. > > That'd be a neat bodge/hack. But that's not important right now. > Fair enough. Moving on. > > Finally, we look at what the mailbox command-set requirements are for > multi-headed devices: > > > 7.6.7.5 Multi-Headed Device Command Set > > The Multi-Headed device command set includes commands for querying the > > Head-to-LD mapping in a Multi-Headed device. Support for this command > > set is required on the LD Pool CCI of a Multi-Headed device. > > > > Ambiguity #2: Ok, now we're not sure whether an MH-SLD is supposed to > expose an LD Pool CCI or not. Also, is a MH-SLD supposed to show up > as something other than just an SLD? This is really confusing. > > Going back to the MLD Port Command set, we see > > > Valid targets for the tunneled commands include switch MLD Ports, > > valid LDs within an MLD, and the LD Pool CCI in a Multi-Headed device. > > Whatever the case, there's only a single command in the MHD command set: > > > 7.6.7.5.1 Get Multi-Headed Info (Opcode 5500h) > > This command is pretty straight forward, it just tells you what the head > to LD mapping is for each of the LDs in the device. Presumably this is > what gets modified by the FM-APIs when LDs are attached to VCS ports. > > For the simplest MH-SLD device, these fields would be immutable, and > there would be a single LD for each head, where head_id == ld_id. Agreed. > > > > So summarizing, what I took away from this was the following: > > In the simplest form of MH-SLD, there's is neither a switch, nor is > thereo LD management. So, presumably, we don't HAVE to implement the > MHD commands to say we "have MH-SLD support". Whilst theoretically possible - I don' think such a device is interesting. Minimum I'd want to see is something with multiple upstream SLD ports and a management LD with appropriate interface to poke it. The MLD side of things is interesting only once we support MLDs in general in QEMU CXL emulation and even then they are near invisible to a host and are more interesting for emulating fabric management. What you may want to do is take Fan's work on DCD and look at doing a simple MH-SLD device that uses same cheat of just using QMP commands to do the configuration. That's an intermediate step to us getting the FM-API and similar commands implemented. > > > ======== > Design > ======== > > Ok... that's a lot to break down. Here's what I think the roadmap > toward multi-headed multi-logical device support should look like: > > 1. SLD - we have this. This is struct CXLType3Dev We could look at Switch + MLD after this, but lots of work to get the FM-API stuff in place that makes that interesting. The advantage being we'd have the ability to move LDs around that I think you are interested in. > > 2. MH-SLD No Switch, No Pool CCI. I'd fiddle that a little. To be useful it needs the functionality that a pool CCI provides - something to change the confirmation, but that can be impdef - (QMP stuff like Fan Ni did for DCD). I'm not sure we want to upstream the QMP side of things but it gives a path to start messing around iwth this quicker. > > 3. MH-SLD w/ Pool CCI (Implementing only Get Multi-Headed Info) I'd do this + DCD. > > 4. MH-SLD w/ Switch (Implementing remap-ability of LD to Head) Hmm. You want this for migration I guess. I'd be tempted to jump directly to DCD. I'm not even sure if the spec really allows this sort of remapping with out a switch / MHD because DCD covers that gap. > > 5. MH-MLD - the whole kit and kaboodle. > > > Lets talk about what the first MH-SLD might look like. > > > ================================= > 2. MH-SLD No Switch, No Pool CCI. > ================================= > > 1. The device has a "memory pool" that "backs" each Logical Device, and > the specification does not limit whether this memory is discrete > or may be shared between heads. > > In QEMU, we can represent this with a shared or file memory backend: > > -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true > > > 2. Each QEMU instance has a discrete SLD that amounts to its own private > CXLType3Dev. However, each "Head" maps back to the same common > memory backend: > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 > > > And that's it. In fact, you can do this now, no changes needed! > > > But it's also not very useful. You can only use the memory in devdax > mode, since it's a shared memory region. You could already do this via > the /dev/shm interface, so it's not even new functionality. > > In theory you could build a pooling service in software-only on top of > memory blocks. That's an exercise left to the reader. Yeah. Let's not do this step. > > > ================================================================ > 3. MH-SLD w/ Pool CCI (Implementing only Get Multi-Headed Info) > ================================================================ > > This is a little more complicated, we have our first bit of shared > state. Originally I had considered a shared memory region in > CXLType3Dev, but this is a backwards abstraction (A MH-SLD contains > mutliple SLDs, an SLD does not contain an MHD State). > > diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h > index 7b72345079..1a9f2708e1 100644 > --- a/include/hw/cxl/cxl_device.h > +++ b/include/hw/cxl/cxl_device.h > @@ -356,16 +356,6 @@ typedef struct CXLPoison { > typedef QLIST_HEAD(, CXLPoison) CXLPoisonList; > #define CXL_POISON_LIST_LIMIT 256 > > +struct CXLMHDState { > + uint8_t nr_heads; > + uint8_t nr_lds; > + uint8_t ldmap[]; > +}; > + > struct CXLType3Dev { > /* Private */ > PCIDevice parent_obj; > @@ -377,15 +367,6 @@ struct CXLType3Dev { > HostMemoryBackend *lsa; > uint64_t sn; > > + > + /* Multi-headed device settings */ > + struct { > + bool active; > + uint32_t headid; > + uint32_t shmid; > + struct CXLMHDState *state; > + } mhd; > + > > > The way you would instantiate this would be a via a separate process > that initializes the shared memory region: > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > ./cxl_mhd_init 4 $shmid1 > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1 > > ./cxl_mhd_init would simply setup the nr_heads/lds field and such > and set ldmap[0-3] to the values [0-3]. i.e. the head-to-ld mappings > are static (head_id==ld_id). > > > > But like I said, this is a backwards abstraction, so realistically we > should flip this around such that we have the following: > > struct CXLMHD_SharedState { > uint8_t nr_heads; > uint8_t nr_lds; > uint8_t ldmap[]; > }; > > struct CXLMH_SLD { > uint32_t headid; > uint32_t shmid; > struct CXLMHD_SharedState *state; > struct CXLType3Dev sld; > }; > > The shared state would be instantiated the same way as above. > > With this we'd basically just create a new memory device: > > hw/mem/cxl_mh_sld.c > > > This is pretty straightforward - we just expose some of cxl_type3.c > functions in order to instantiate the device accordingly, the rest of it > just becomes passthrough because... it's just a cxl_type3.c device. > > > This ultimately manifests as: > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > > ./cxl_mhd_init 4 $shmid1 > > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid > > > Note: This is the patch set i'm working towards, but I presume there > might be some (strong) opinions, so i didn't want to get too far into > development before posting this. Key here is that what is actually interesting is MH-SLD with Dynamic Capacity, not just sharing the whole mapped memory. That gives us the flexibility to move memory between heads. A few different moving parts are needed and I think we'd end up with something that looks like -device cxl-mhd,volatile-memdev=mem0,id=backend -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2 dev1 provides the tunneling interface, but the actual implementation of the pool CCI and actual memory mappings is in the backend. Note that backend might be proxy to an external process, or a client/server approach between multiple QEMU instances. The Pool CCI is accessed via tunnel from dev1 and can both query everything about the two heads and also perform DCD capacity add / release on the LDs. That can potentially include shared capacity and all the other bells and whistles we get doing DCD on an MLD device. or squish some parts and make a more extensible type3 device and have. -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2 Possibly adding socket numbers as options if we are doing multi qemu support (can do that later I think as long as we've thought about how to do the command line). > > > ============================================================== > 4. MH-SLD w/ Switch (Implementing LD management in an SLD) > ============================================================== > > Is it even rational to try to build such a device? > > MH-SLDs have a 1-to-1 mapping of Head:Logical Device. > > Presumably each SLD maps the entirety of the "pooled" memory, > but the specification does not state that is true. You could, for > example, setup each Logical Device to map to a particular portion of the > shared/pooled memory area: DCD is again key here. You can't move LDs around on an MH-SLD, but5 you can move capacity around between them using DCD. > > -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true > > QEMU #1 > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=0,dpa_limit=1G > > QEMU #2 > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=1G,dpa_limit=1G > > ... and so on. > > At least in theory, this would involve implementing something that > changes which SLD is mapped to a QEMU instance - but functionally this > is just changing the base and limit of each SLD. > > It's interesting from a functional testing perspective, there's a bunch > of CCI/Tunnel commands that could be implemented, and presumably this > would require a separate process to manage/serialize appropriately. > If this is interesting, do a normal MLD and switch first. The MHD case is something to stack on top of that. > ======================================= > 5. MH-MLD - the whole kit and kaboodle. > ======================================= > > If we implemented MH-SLD w/ Switching, then presumably it's just on step > further to create an MLD: > > struct CXLMH_MLD { > uint32_t headid; > uint32_t shmid; > struct CXLMHD_SharedState *state; > struct CXLType3Dev ldmap[]; > }; > > But i'm greatly oversimplifying here. It's much more expressive to > describe an MLD in terms of a multi-tired switch in the QEMU topology, > similar to what can be done right now: > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 \ > -device cxl-rp,id=rp0,port=0,bus=cxl.0,chassis=0,slot=0 \ > -device cxl-rp,id=rp1,port=1,bus=cxl.0,chassis=0,slot=1 \ > -device cxl-upstream,bus=rp0,id=us0 \ > -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ > -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ > -device cxl-type3,bus=swport0,volatile-memdev=mem0,id=cxl-mem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k > > > But in order to make this multi-headed, some amount of this state would need > to be encapsulated in a shared memory region (or would it? I don't know, i > haven't finished this thought experiment yet). Someone (wherever the LD pool CCI is) needs to hold shared state. Lots of options for that. > > > ===== > FIN > ===== > > I realize this was a long. If you made it to the end of this email, > thank you reading my TED talk. I greatly appreciate any comments, > even if it's just "You've gone too deep, Gregory." ;] :) You've only just got started. This goes much deeper! > > Regards, > ~Gregory To my mind there are a series of steps and questions here. Which 'hotplug model'. 1) LD model for moving capacity - If doing LD model, do MLDs and configurable switches first. Needed as a step along the path anyway. Deal with all the mess that brings and come back to MHD - as you note it only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway. 2) DCD model for moving cacapcity - MH-SLD with a pool CCI used to do DCD operations on the LDs. Extension of what Fan Ni is looking at. He's making an SLD pretend to be a device where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't do that without figuring out how to do an MHD-SLD - or at least a head that we intend to hang this new stuff off - potentially just using the existing type 3 device with more parameters as one of the MH-SLD heads that doesn't have the control interface and different parameters if it does have the tunnel to the Pool CCI. Implementing MCTP CCI. Probably a later step, but need to think what that looks like. I'm thinking we proxy it through to wherever the pool CCI ends up. Should be easy enough if a little ugly. So question is whether it's worth a highly modular design, or we just keep tacking flexibility onto existing Type 3 device emulation. These are all type 3 devices after all ;) Lots of fun details to hammer out. Jonathan
On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote: > On Tue, 21 Mar 2023 21:50:33 -0400 > Gregory Price <gregory.price@memverge.com> wrote: > > > > > Ambiguity #1: > > > > * An SLD contains 1 Logical Device. > > * An MH-SLD presents multiple SLDs, one per head. > > > > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the > > definition of LD, but not according to the definition of MLD, or MH-MLD. > > I'd go with 'sort of'. SLD is a presentation of a device to a host. > It can be a normal single headed MLD that has been plugged directly into a host. > > So for extra fun points you can have one MH-MLD that has some ports connected > to switches and other directly to hosts. Thus it can present as SLD on some > upstream ports and as MLD on others. > I suppose this section of the email was really to just point out that what constitutions a "multi-headed", "logical", and "multi-logical" device is rather confusing from just reading the spec. Since writing this, i've kind of settled on: MH-* - anything with multiple heads, regardless of how it works SLD - one LD per head, but LD does not imply any particular command set MLD - multiple LD's per head, but those LD's may only attach to one head DCD - anything can technically be a DCD if it implements the commands Trying to figure out, from the spec, "what commands an MH-SLD" should implement to be "Spec Compliance" was my frustration. It's somewhat clear now that the answer is "Technically nothing... unless its an MLD". > > I want to make very close note of this. SLD's are managed like SLDs > > SLDs, MLDs are managed like MLDs. MH-SLDs, according to this, should be > > managed like SLDs from the perspective of each host. > > True, but an MH-MLD device connected directly to a host will also > be managed (at some level anyway) as an SLD on that particular port. > The ambiguous part is... what commands relate specifically to an SLD? The spec isn't really written that way, and the answer is that an SLD is more of a lack of other functionality (specifically MLD functionality), rather than its own set of functionality. i.e. an SLD does not require an FM-Owned LD for management, but an MHD, MLD, and DCD all do (at least in theory). > > > > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores > > that MH-SLDs (may) exist. That's frustrating to say the least, but I > > suppose we can gather from context that MH-SLD's *MAY NOT* have LD > > management controls. > > Hmm. In theory you could have an MH-SLD that used a config from flash or similar > but that would be odd. We need some level of dynamic control to make these > devices useful. Doesn't mean the spec should exclude dumb devices, but > we shouldn't concentrate on them for emulation. > > One possible usecase would be a device that always shares all it's memory on > all ports. Yuk. > I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is likely to present all memory on all ports, and potentially provide some custom commands to help hosts enforce exclusivity. It's beyond the spec, but this can actually be emulated today with the MH-SLD setup I describe below. Certainly I expected a yuk factor to proposing it, but I think the reality is on the path to 3.0 and DCD devices we should at least entertain that someone will probably do this with real hardware. > > For the simplest MH-SLD device, these fields would be immutable, and > > there would be a single LD for each head, where head_id == ld_id. > > Agreed. > > > > > So summarizing, what I took away from this was the following: > > > > In the simplest form of MH-SLD, there's is neither a switch, nor is > > there LD management. So, presumably, we don't HAVE to implement the > > MHD commands to say we "have MH-SLD support". > > Whilst theoretically possible - I don' think such a device is interesting. > Minimum I'd want to see is something with multiple upstream SLD ports > and a management LD with appropriate interface to poke it. > > > The MLD side of things is interesting only once we support MLDs in general > in QEMU CXL emulation and even then they are near invisible to a host > and are more interesting for emulating fabric management. > > What you may want to do is take Fan's work on DCD and look at doing > a simple MH-SLD device that uses same cheat of just using QMP commands > to do the configuration. That's an intermediate step to us getting > the FM-API and similar commands implemented. > I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while not having to worry about the complexity of MLD and switches. (I have not gotten the chance to review the DCD patch set yet, it's on my list for after ISC'23, I presume this is what has been done). My thoughts would be that you would have something like the following: -device ct3d,... etc etc -device cxl-dcd,type3-backend=mem0,manager=true the manager would be the owner of the FM-Owned LD, and therefore the system responsible for managing requests for memory. How we pass those messages between instances is then an exercise for the reader. What I have been doing is just creating a shared memory region with mkipc and using a separate program to initiate that shared state before launching the guests. I'll talk about this a little further down. > > > > ... snip ... > > > > 3. MH-SLD w/ Pool CCI (Implementing only Get Multi-Headed Info) > > I'd do this + DCD. > I concur, and it's what i was looking into next. I think your other notes on MH-* with switches is really where I was left scratching my head. When I look at Switch/MLD functionality vs DCD, I have a gut feeling the vast majority of early device vendors are going to skip right over switches and MLD setups and go directly to MH-SLD+DCD. > > ================================= > > 2. MH-SLD No Switch, No Pool CCI. > > ================================= > > > > But it's also not very useful. You can only use the memory in devdax > > mode, since it's a shared memory region. You could already do this via > > the /dev/shm interface, so it's not even new functionality. > > > > In theory you could build a pooling service in software-only on top of > > memory blocks. That's an exercise left to the reader. > > Yeah. Let's not do this step. > To late :]. It was useful as a learning exercise, but it's definitely not upstream quality. I may post it for the sake of the playground, but I too would recommend against this method of pooling in the long term. I made a proto-DCD command set that was reachable from each memdev character device, and exposed it to every qemu instance as part of ct3d (I'm still learning the QEMU ecosystem, so was easier to bodge it in than make a new device and link it up). Then I created a shared memory region with mkipc, and implemented a simple mutex in the space, as well as all the record keeping needed to manage sections/extents. > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > > ./cxl_mhd_init 4 $shmid1 > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1 > > > > ./cxl_mhd_init would simply setup the nr_heads/lds field and such > > and set ldmap[0-3] to the values [0-3]. i.e. the head-to-ld mappings > > are static (head_id==ld_id). > > ... snip ... > > > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > > ./cxl_mhd_init 4 $shmid1 > > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid The last step was a few extra lines in the read/write functions to ensure accesses to "Valid addresses" that "Aren't allocated" produce errors. At this point, each guest is capable basically using the device to do the coordination for you by simply calling the allocate/deallocate functions. And that's it, you've got pooling. Each guest sees the full extent of the entire device, but must ask the device for access to a given section, and the section can be translated into a memory block number under the given numa node. Ok, now lets talk about why this is a bad and why you shouldn't do it this way: * Technically a number of bios/hardware interleave functionality can bite you pretty hard when making the assumption that memory blocks are physically contiguous hardware addresses. However, that assumption holds if you simply don't turn those options on, so it might be useful as an early-adopter platform. * The security posutre of a device like this is bad. It requires each attached host to clear the memory before releasing it. There isn't really a good way to do this in numa-mode, so you would have to implement custom firmware commands to ensure it happens, and that means custom drivers blah blah blah - not great. Basically you're trusting each host to play nice. Not great. But potentially useful for early adopters regardless. * General compaitibility and being in-spec - this design requires a number of non-spec extensions, so just generally not recommended, certainly not here in QEMU. > > A few different moving parts are needed and I think we'd end up with something that > looks like > > -device cxl-mhd,volatile-memdev=mem0,id=backend > -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true > -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2 > > dev1 provides the tunneling interface, but the actual implementation of > the pool CCI and actual memory mappings is in the backend. Note that backend > might be proxy to an external process, or a client/server approach between multiple > QEMU instances. I've hummed and hawwed over external process vs another QEMU instance and I still haven't come to a satisfying answer here. It feels extremely heavy-handed to use an entirely separate QEMU instance just for this, but there's nothing to say you can't just host it in one of the head-attached instances. I basically skipped this and allowed each instance to send the command themselves, but serialized it with a mutex. That way each instance can operate cleanly without directly coordinating with each other. I could see a vendor implementing it this way on early devices. I don't have a good answer for this yet, but maybe once I review the DCD patch set I'll have more opinions. > > or squish some parts and make a more extensible type3 device and have. > > -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true > -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2 > I originally went this route, but the downside of this is "What happens when the main dies and has to restart". There's all of kinds of badness associated with that. It's why i moved the shared state into a separately created mkipc region. > > To my mind there are a series of steps and questions here. > > Which 'hotplug model'. > 1) LD model for moving capacity > - If doing LD model, do MLDs and configurable switches first. Needed as a step along the > path anyway. Deal with all the mess that brings and come back to MHD - as you note it > only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway. > > 2) DCD model for moving cacapcity > - MH-SLD with a pool CCI used to do DCD operations on the LDs. Extension of > what Fan Ni is looking at. He's making an SLD pretend to be a device > where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't > do that without figuring out how to do an MHD-SLD - or at least a head that we intend > to hang this new stuff off - potentially just using the existing type 3 device with > more parameters as one of the MH-SLD heads that doesn't have the control interface and > different parameters if it does have the tunnel to the Pool CCI. > Personally I think we should focus on the DCD model. In fact, I think we're already very close to this, as my personal prototype showed this can work fairly cleanly, and I imagine I'll have a quick MHD patch set once I get the change to review the DCD patch set. If I'm being the honest, the more I look at the LD model, the less I like it, but I understand that's how scale is going to be achieved. I don't know if focusing on that design right now is going to produce adoption in the short term, since we're not likely to see those devices for a few years. MH-SLD+DCD is likely to show up much sooner, so I will target that. ~Gregory
On Tue, 16 May 2023 02:20:07 -0400 Gregory Price <gregory.price@memverge.com> wrote: > On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote: > > On Tue, 21 Mar 2023 21:50:33 -0400 > > Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > > Ambiguity #1: > > > > > > * An SLD contains 1 Logical Device. > > > * An MH-SLD presents multiple SLDs, one per head. > > > > > > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the > > > definition of LD, but not according to the definition of MLD, or MH-MLD. > > > > I'd go with 'sort of'. SLD is a presentation of a device to a host. > > It can be a normal single headed MLD that has been plugged directly into a host. > > > > So for extra fun points you can have one MH-MLD that has some ports connected > > to switches and other directly to hosts. Thus it can present as SLD on some > > upstream ports and as MLD on others. > > > > I suppose this section of the email was really to just point out that > what constitutions a "multi-headed", "logical", and "multi-logical" > device is rather confusing from just reading the spec. Since writing > this, i've kind of settled on: > > MH-* - anything with multiple heads, regardless of how it works > SLD - one LD per head, but LD does not imply any particular command set > MLD - multiple LD's per head, but those LD's may only attach to one head > DCD - anything can technically be a DCD if it implements the commands > > Trying to figure out, from the spec, "what commands an MH-SLD" should > implement to be "Spec Compliance" was my frustration. It's somewhat > clear now that the answer is "Technically nothing... unless its an MLD". Sounds about right :) Some of this is intentional - it's a grab bag of features an options not a nice clean definition of 'the right set to implement'. Market should probably drive that. I think expectation is defacto feature set standards will happen - but outside of the CXL spec. > > > > I want to make very close note of this. SLD's are managed like SLDs > > > SLDs, MLDs are managed like MLDs. MH-SLDs, according to this, should be > > > managed like SLDs from the perspective of each host. > > > > True, but an MH-MLD device connected directly to a host will also > > be managed (at some level anyway) as an SLD on that particular port. > > > > The ambiguous part is... what commands relate specifically to an SLD? > The spec isn't really written that way, and the answer is that an SLD is > more of a lack of other functionality (specifically MLD functionality), > rather than its own set of functionality. Yup. > > i.e. an SLD does not require an FM-Owned LD for management, but an MHD, > MLD, and DCD all do (at least in theory). DCD 'might' though I don't think anything in the spec rules that you 'must' control the SLD/MLD via the FM-API, it's just a spec provided option. From our point of view we don't want to get more creative so lets assume it does. I can't immediately think of reason for a single head SLD to have an FM owned LD, though it may well have an MCTP CCI for querying stuff about it from an FM. > > > > > > > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores > > > that MH-SLDs (may) exist. That's frustrating to say the least, but I > > > suppose we can gather from context that MH-SLD's *MAY NOT* have LD > > > management controls. > > > > Hmm. In theory you could have an MH-SLD that used a config from flash or similar > > but that would be odd. We need some level of dynamic control to make these > > devices useful. Doesn't mean the spec should exclude dumb devices, but > > we shouldn't concentrate on them for emulation. > > > > One possible usecase would be a device that always shares all it's memory on > > all ports. Yuk. > > > > I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is > likely to present all memory on all ports, and potentially provide some > custom commands to help hosts enforce exclusivity. > > It's beyond the spec, but this can actually be emulated today with the > MH-SLD setup I describe below. Certainly I expected a yuk factor to > proposing it, but I think the reality is on the path to 3.0 and DCD > devices we should at least entertain that someone will probably do this > with real hardware. From point of view of the Spec what you describe is an MH-SLD in which all the memory is shared - non coherent. That's a valid choice - be it a much nastier option than either DCD based or sharing with coherency. It might fall out as an option in a flexibly defined MLD, but I'm not particularly interested in that case (don't mind if you are though!) > > > > For the simplest MH-SLD device, these fields would be immutable, and > > > there would be a single LD for each head, where head_id == ld_id. > > > > Agreed. > > > > > > > > So summarizing, what I took away from this was the following: > > > > > > In the simplest form of MH-SLD, there's is neither a switch, nor is > > > there LD management. So, presumably, we don't HAVE to implement the > > > MHD commands to say we "have MH-SLD support". > > > > Whilst theoretically possible - I don' think such a device is interesting. > > Minimum I'd want to see is something with multiple upstream SLD ports > > and a management LD with appropriate interface to poke it. > > > > > > The MLD side of things is interesting only once we support MLDs in general > > in QEMU CXL emulation and even then they are near invisible to a host > > and are more interesting for emulating fabric management. > > > > What you may want to do is take Fan's work on DCD and look at doing > > a simple MH-SLD device that uses same cheat of just using QMP commands > > to do the configuration. That's an intermediate step to us getting > > the FM-API and similar commands implemented. > > > > I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while > not having to worry about the complexity of MLD and switches. Maybe, there are other flows that only work for MLD and switches that aren't MHD related (hotplug basically) so those might get explored in parallel. > > (I have not gotten the chance to review the DCD patch set yet, it's on > my list for after ISC'23, I presume this is what has been done). At moment it's just an SLD with DCD presentation to host. Nothing on the control side. > > My thoughts would be that you would have something like the following: > > -device ct3d,... etc etc > -device cxl-dcd,type3-backend=mem0,manager=true DCD is just an aspect of a type 3 device. I'm fine with a manager element, but don't call it cxl-dcd. > > the manager would be the owner of the FM-Owned LD, and therefore the > system responsible for managing requests for memory. > > How we pass those messages between instances is then an exercise for the > reader. > > > What I have been doing is just creating a shared memory region with > mkipc and using a separate program to initiate that shared state before > launching the guests. I'll talk about this a little further down. > > > > > > > > ... snip ... > > > > > > 3. MH-SLD w/ Pool CCI (Implementing only Get Multi-Headed Info) > > > > I'd do this + DCD. > > > > I concur, and it's what i was looking into next. > > I think your other notes on MH-* with switches is really where I was > left scratching my head. > > When I look at Switch/MLD functionality vs DCD, I have a gut feeling the > vast majority of early device vendors are going to skip right over > switches and MLD setups and go directly to MH-SLD+DCD. Yup. That's likely - though probably more driven by switch latency concerns than by complexity. MLDs aren't too bad, and the DCD parts etc are the same as for SLD. > > > > ================================= > > > 2. MH-SLD No Switch, No Pool CCI. > > > ================================= > > > > > > But it's also not very useful. You can only use the memory in devdax > > > mode, since it's a shared memory region. You could already do this via > > > the /dev/shm interface, so it's not even new functionality. > > > > > > In theory you could build a pooling service in software-only on top of > > > memory blocks. That's an exercise left to the reader. > > > > Yeah. Let's not do this step. > > > > To late :]. It was useful as a learning exercise, but it's definitely > not upstream quality. I may post it for the sake of the playground, but > I too would recommend against this method of pooling in the long term. > > I made a proto-DCD command set that was reachable from each memdev > character device, and exposed it to every qemu instance as part of ct3d > (I'm still learning the QEMU ecosystem, so was easier to bodge it in > than make a new device and link it up). > > Then I created a shared memory region with mkipc, and implemented a > simple mutex in the space, as well as all the record keeping needed to > manage sections/extents. > > > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > > > ./cxl_mhd_init 4 $shmid1 > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1 > > > > > > ./cxl_mhd_init would simply setup the nr_heads/lds field and such > > > and set ldmap[0-3] to the values [0-3]. i.e. the head-to-ld mappings > > > are static (head_id==ld_id). > > > ... snip ... > > > > > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1` > > > ./cxl_mhd_init 4 $shmid1 > > > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid > > The last step was a few extra lines in the read/write functions to > ensure accesses to "Valid addresses" that "Aren't allocated" produce > errors. > > At this point, each guest is capable basically using the device to do > the coordination for you by simply calling the allocate/deallocate > functions. > > And that's it, you've got pooling. Each guest sees the full extent of > the entire device, but must ask the device for access to a given > section, and the section can be translated into a memory block number > under the given numa node. > This is a valid model, but I'd do the management out of band. We could add BI support though if that's really useful (I don't think I can go into why non coherency is a problem in some real hardware for this usecase... watch this space) > > Ok, now lets talk about why this is a bad and why you shouldn't do it > this way: > > * Technically a number of bios/hardware interleave functionality can > bite you pretty hard when making the assumption that memory blocks are > physically contiguous hardware addresses. However, that assumption > holds if you simply don't turn those options on, so it might be useful > as an early-adopter platform. I'd never enumerate these from BIOS - doesn't make sense for something there for dynamic runtime allocation. Interleave indeed hard - don't do it (yet) > > > * The security posutre of a device like this is bad. It requires each > attached host to clear the memory before releasing it. There isn't > really a good way to do this in numa-mode, so you would have to > implement custom firmware commands to ensure it happens, and that > means custom drivers blah blah blah - not great. > > Basically you're trusting each host to play nice. Not great. > But potentially useful for early adopters regardless. Agreed. I'd put it down as horrible - no one should build this ;) > > > * General compaitibility and being in-spec - this design requires a > number of non-spec extensions, so just generally not recommended, > certainly not here in QEMU. Hmm. Does it? Looks to me like it would be present to hosts as multiple SLDs with volatile regions and a CDAT that presents DSMAS with flags for Shareable / !Hardware managed coherency (or wire up the missing bits of BI enablement in QEMU - doesn't actually do anything but we should provide the various registers and correctly enable it in the kernel) The fact it's an MHD isn't visible to the hosts, so I think this is spec compliant if odd. If you meant the control path. Also fine as long as you just do it through memory and a 'convention' on software side for where that shared set of info is + some fun algorithms to deal with mutex etc. Needs BI support though. > > > > > A few different moving parts are needed and I think we'd end up with something that > > looks like > > > > -device cxl-mhd,volatile-memdev=mem0,id=backend > > -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true > > -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2 > > > > dev1 provides the tunneling interface, but the actual implementation of > > the pool CCI and actual memory mappings is in the backend. Note that backend > > might be proxy to an external process, or a client/server approach between multiple > > QEMU instances. > > I've hummed and hawwed over external process vs another QEMU instance and I > still haven't come to a satisfying answer here. It feels extremely > heavy-handed to use an entirely separate QEMU instance just for this, > but there's nothing to say you can't just host it in one of the > head-attached instances. MHD is only really interesting (for hardware coherent sharing anyway) if you have multiple host OS so that's multiple QEMU instances. If there is a 'main' instance of QEMU then everything should still work though so can leave the subordinate instances for future work. > > I basically skipped this and allowed each instance to send the command > themselves, but serialized it with a mutex. That way each instance can > operate cleanly without directly coordinating with each other. I could > see a vendor implementing it this way on early devices. That control would have to be out of band if using memory, or require BI. BI requires CXL 3.0 host, whereas a DCD based MHD (sure defined in CXL 3.0) would work with a CXL 2.0 host - well probably even a CXL 1.1 host, but who wants to bother with those.. I suspect we'll see impdef versions. DCD is at heart pretty simple though so I'd expect it to turn up fairly fast after first memory pool devices. > > I don't have a good answer for this yet, but maybe once I review the DCD > patch set I'll have more opinions. > > > > > or squish some parts and make a more extensible type3 device and have. > > > > -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true > > -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2 > > > > I originally went this route, but the downside of this is "What happens > when the main dies and has to restart". There's all of kinds of > badness associated with that. It's why i moved the shared state into a > separately created mkipc region. Modelling. I don't care if that happens :) You are right however that it would need more care. External process probably makes sense - can be pretty light weight. > > > > > To my mind there are a series of steps and questions here. > > > > Which 'hotplug model'. > > 1) LD model for moving capacity > > - If doing LD model, do MLDs and configurable switches first. Needed as a step along the > > path anyway. Deal with all the mess that brings and come back to MHD - as you note it > > only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway. > > > > 2) DCD model for moving cacapcity > > - MH-SLD with a pool CCI used to do DCD operations on the LDs. Extension of > > what Fan Ni is looking at. He's making an SLD pretend to be a device > > where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't > > do that without figuring out how to do an MHD-SLD - or at least a head that we intend > > to hang this new stuff off - potentially just using the existing type 3 device with > > more parameters as one of the MH-SLD heads that doesn't have the control interface and > > different parameters if it does have the tunnel to the Pool CCI. > > > > Personally I think we should focus on the DCD model. In fact, I think > we're already very close to this, as my personal prototype showed this > can work fairly cleanly, and I imagine I'll have a quick MHD patch set > once I get the change to review the DCD patch set. Agreed. It's easier. > > If I'm being the honest, the more I look at the LD model, the less I > like it, but I understand that's how scale is going to be achieved. I > don't know if focusing on that design right now is going to produce > adoption in the short term, since we're not likely to see those devices > for a few years. > > MH-SLD+DCD is likely to show up much sooner, so I will target that. Yup. The switch case is interesting for driving Fabric Manager architecture so I'd like to enable it at somepoint, but device wise MH-SLD+DCD is probably going to come first. I might extend the switch CCI or the MCTP CCI PoC enough to get comms up and query stuff, but focus will be on a mailbox on the MHD - note have to ensure only one of those for the FM API controls driving DCD. Jonathan > > ~Gregory
On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote: > > > > i.e. an SLD does not require an FM-Owned LD for management, but an MHD, > > MLD, and DCD all do (at least in theory). > > DCD 'might' though I don't think anything in the spec rules that you 'must' > control the SLD/MLD via the FM-API, it's just a spec provided option. > From our point of view we don't want to get more creative so lets assume > it does. > > I can't immediately think of reason for a single head SLD to have an FM owned > LD, though it may well have an MCTP CCI for querying stuff about it from an FM. > Before I go running off into the woods, it seems like it would be simple enough to simply make an FM-LD "device" which simply links a mhXXX device and implements its own Mailbox CCI. Maybe not "realistic", but to my mind this appears as a separate character device in /dev/cxl/*. Maybe the realism here doesn't matter, since we're just implementing for the sake of testing. This is just a straightforward way to pipe a DCD request into the device and trigger DCD event log entries. As commented early, this is done as a QEMU fed event. If that's sufficient, a hack like this feels like it would be at least mildly cleaner and easier to test against. Example: consider a user wanting to issue a DCD command to add capacity. Real world: this would be some out of band communication, and eventually this results in a DCD command to the device that results in a capacity-event showing up in the log. Maybe it happens over TCP and drills down to a Redfish event that talks to the BMC that issues a command over etc etc MTCP emulations, etc. With a simplistic /dev/cxl/memX-fmld device a user can simply issue these commands without all that, and the effect is the same. On the QEMU side you get something like: -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true -device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=XXXXX -device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=YYYYY on the Linux side you get: /dev/cxl/mem0 /dev/cxl/mem0-fmld in this example, the shmid for mhsld is a shared memory region created with mkipc that implements the shared state (basically section bitmap tracking and the actual plumbing for DCD, etc). This limits the emulation of the mhsld to a single host for now, but that seems sufficient. The shmid for cxl-fmld implements any shared state for the fmld, including a mutex, that allows all hosts attached to the mhsld to have access to the fmld. This may or may not be realistic, but it would allow all head-attached hosts to send DCD commands over its own local fabric, ratehr than going out of band. This gets us to the point where, at a minimum, each host can issue its own DCD commands to add capacity to itself. That's step 1. Step 2 is allow Host A to issue a DCD command to add capacity to Host B. I suppose this could be done via a backgruond thread that waits on a message to show up in the shared memory region? Being somewhat unfamiliar with QEMU, is it kosher to start background threads that just wait on events like this, or is that generally frowed upon? If done this way, it would stimplify the creation and startup sequence at least. ~Gregory
On Mon, 29 May 2023 14:13:07 -0400 Gregory Price <gregory.price@memverge.com> wrote: > On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote: > > > > > > i.e. an SLD does not require an FM-Owned LD for management, but an MHD, > > > MLD, and DCD all do (at least in theory). > > > > DCD 'might' though I don't think anything in the spec rules that you 'must' > > control the SLD/MLD via the FM-API, it's just a spec provided option. > > From our point of view we don't want to get more creative so lets assume > > it does. > > > > I can't immediately think of reason for a single head SLD to have an FM owned > > LD, though it may well have an MCTP CCI for querying stuff about it from an FM. > > Sorry for slow reply - got distracted and forgot to cycle back to this. > > Before I go running off into the woods, it seems like it would be simple > enough to simply make an FM-LD "device" which simply links a mhXXX device > and implements its own Mailbox CCI. > > Maybe not "realistic", but to my mind this appears as a separate > character device in /dev/cxl/*. Maybe the realism here doesn't matter, > since we're just implementing for the sake of testing. This is just a > straightforward way to pipe a DCD request into the device and trigger > DCD event log entries. > > As commented early, this is done as a QEMU fed event. If that's > sufficient, a hack like this feels like it would be at least mildly > cleaner and easier to test against. Or MCTP over I2C which works today, but needs more commands for this :) I plan to look at the tunneling stuff shortly. Initially I'll punt the guest using this to userspace, but potentially the eventual model might well be to make it look like a bunch of direct attached CCIs from userspace point of view. I'm not 100% keen on pushing the management of hotplug into the kernel though as particular CCIs we are tunneling to in a wider fabric may come and and go. For an MHD this would be easy, not so much if a switch CCI with tunneling to MLDs and MH-MLDs below it. > > > Example: consider a user wanting to issue a DCD command to add capacity. > > Real world: this would be some out of band communication, and eventually > this results in a DCD command to the device that results in a > capacity-event showing up in the log. Maybe it happens over TCP and > drills down to a Redfish event that talks to the BMC that issues a > command over etc etc MTCP emulations, etc. > > With a simplistic /dev/cxl/memX-fmld device a user can simply issue these > commands without all that, and the effect is the same. Yup - something along those lines makes sense. > > On the QEMU side you get something like: > > -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true I'd expect this device to present the mailbox commands for tunneling to the FM-LD - as such I'd want a reference form here to your cxl-fmld below. > -device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=XXXXX Not sure why this is on the bus rp0. > -device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=YYYYY To be spec compliant that cxl-fmld still has to support normal use as well as tunnelling to the fm owned LD - so it's a superset of a type3 device. My gut feeling is keep it simple for a PoC / supporting enablement. 1 device on the host that is also service as the FM. Probably just an extended type 3 with some more options to turn this feature on. 1 device on each other host that connects via socket All device share same underlying memory. Access bitmap is fiddly - either a push model over socket, or a shared bitmap like you suggest. Either works, not sure which ends up cleaner. It may well become more devices over time, but that should be driven by the different types of CCI sharing common infrastructure rather than trying to figure out that model at the start. > > on the Linux side you get: > /dev/cxl/mem0 > /dev/cxl/mem0-fmld > > in this example, the shmid for mhsld is a shared memory region created > with mkipc that implements the shared state (basically section bitmap > tracking and the actual plumbing for DCD, etc). This limits the emulation > of the mhsld to a single host for now, but that seems sufficient. > > The shmid for cxl-fmld implements any shared state for the fmld, > including a mutex, that allows all hosts attached to the mhsld to have > access to the fmld. This may or may not be realistic, but it would > allow all head-attached hosts to send DCD commands over its own local > fabric, ratehr than going out of band. Not keen on that part. I'd like to keep close to the spec intent and only allow one host to access the FM-LD. > > This gets us to the point where, at a minimum, each host can issue its > own DCD commands to add capacity to itself. That's step 1. I don't agree with this one. I really don't want hosts to be able to do that. They need to talk to one host that is acting as fabric manager - that can then talk to the MHD to do the allocations. > > Step 2 is allow Host A to issue a DCD command to add capacity to Host B. > > I suppose this could be done via a backgruond thread that waits on a > message to show up in the shared memory region? The actual setup should be done via the single host with the FM, but there is still a need to notify the other hosts. I'd be tempted to do that via a socket rather than shared memory. Just keep the shared memory for the access bitmap. Or drop that access bitmap entirely and rely on each host keeping track of it's own access permissions. For testing purposes I don't have a problem with insisting the owner of the FM-LD must be started first and closed last. That ties lifetime of that host with that of the device, but that isn't too much of a problem given the lifetime differences we may want to test probably sit at the FM software layer, not the emulation of the hardware. > > Being somewhat unfamiliar with QEMU, is it kosher to start background > threads that just wait on events like this, or is that generally frowed > upon? If done this way, it would stimplify the creation and startup > sequence at least. > > ~Gregory
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h index 7b72345079..1a9f2708e1 100644 --- a/include/hw/cxl/cxl_device.h +++ b/include/hw/cxl/cxl_device.h @@ -356,16 +356,6 @@ typedef struct CXLPoison { typedef QLIST_HEAD(, CXLPoison) CXLPoisonList; #define CXL_POISON_LIST_LIMIT 256 +struct CXLMHDState { + uint8_t nr_heads; + uint8_t nr_lds; + uint8_t ldmap[]; +}; + struct CXLType3Dev { /* Private */ PCIDevice parent_obj; @@ -377,15 +367,6 @@ struct CXLType3Dev { HostMemoryBackend *lsa; uint64_t sn; + + /* Multi-headed device settings */ + struct { + bool active; + uint32_t headid; + uint32_t shmid; + struct CXLMHDState *state; + } mhd; + The way you would instantiate this would be a via a separate process that initializes the shared memory region: shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`