Message ID | 20241212113440.352958-19-dlemoal@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | NVMe PCI endpoint target driver | expand |
On Thu, Dec 12, 2024 at 08:34:40PM +0900, Damien Le Moal wrote: > Add a documentation file > (Documentation/nvme/nvme-pci-endpoint-target.rst) for the new NVMe PCI > endpoint target driver. This provides an overview of the driver > requirements, capabilities and limitations. A user guide describing how > to setup a NVMe PCI endpoint device using this driver is also provided. > > This document is made accessible also from the PCI endpoint > documentation using a link. Furthermore, since the existing nvme > documentation was not accessible from the top documentation index, an > index file is added to Documentation/nvme and this index listed as > "NVMe Subsystem" in the "Storage interfaces" section of the subsystem > API index. Acked-by: Bjorn Helgaas <bhelgaas@google.com> Applying: Documentation: Document the NVMe PCI endpoint target driver .git/rebase-apply/patch:43: new blank line at EOF. + warning: 1 line adds whitespace errors. > +The NVMe PCI endpoint target driver allows exposing a NVMe target controller > +over a PCIe link, thus implementing an NVMe PCIe device similar to a regular > +M.2 SSD. The target controller is created in the same manner as when using NVMe > +over fabrics: the controller represents the interface to an NVMe subsystem > +using a port. The port transfer type must be configured to be "pci". The > +subsystem can be configured to have namespaces backed by regular files or block > +devices, or can use NVMe passthrough to expose an existing physical NVMe device > +or a NVMe fabrics host controller (e.g. a NVMe TCP host controller). > + > +The NVMe PCI endpoint target driver relies as much as possible on the NVMe > +target core code to parse and execute NVMe commands submitted by the PCI RC > +host. However, using the PCI endpoint framework API and DMA API, the driver is > +also responsible for managing all data transfers over the PCI link. This > +implies that the NVMe PCI endpoint target driver implements several NVMe data > +structure management and some command parsing. Sort of a mix of "PCIe link" vs "PCI link", maybe make them consistent. Does "PCI RC" mean "Root Complex"? If so, maybe "PCIe Root Complex" the first time, and "PCIe RC" subsequently? I don't know enough about this to know whether "Root Complex" is necessary in this context, or whether "host" might be enough. > +4) The boot partition support (BPS), Persistent Memory Region Supported (PMRS) > + and Controller Memory Buffer Supported (CMBS) capabilities are never reported. Gratuitous >80 column line. > +If the PCI endpoint controller used does not support MSIX, MSI can be > +configured instead:: s/MSIX/MSI-X/ as is used elsewhere > +The NVMe PCI endpoint target driver uses the PCI endpoint configfs device attributes as follows. Gratuitous >80 column line.
On Thu, Dec 12, 2024 at 08:34:40PM +0900, Damien Le Moal wrote: > Add a documentation file > (Documentation/nvme/nvme-pci-endpoint-target.rst) for the new NVMe PCI > endpoint target driver. This provides an overview of the driver > requirements, capabilities and limitations. A user guide describing how > to setup a NVMe PCI endpoint device using this driver is also provided. > > This document is made accessible also from the PCI endpoint > documentation using a link. Furthermore, since the existing nvme > documentation was not accessible from the top documentation index, an > index file is added to Documentation/nvme and this index listed as > "NVMe Subsystem" in the "Storage interfaces" section of the subsystem > API index. > > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > Reviewed-by: Christoph Hellwig <hch@lst.de> > --- > Documentation/PCI/endpoint/index.rst | 1 + > .../PCI/endpoint/pci-nvme-function.rst | 14 + > Documentation/nvme/index.rst | 12 + > .../nvme/nvme-pci-endpoint-target.rst | 365 ++++++++++++++++++ > Documentation/subsystem-apis.rst | 1 + > 5 files changed, 393 insertions(+) > create mode 100644 Documentation/PCI/endpoint/pci-nvme-function.rst > create mode 100644 Documentation/nvme/index.rst > create mode 100644 Documentation/nvme/nvme-pci-endpoint-target.rst > > diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst > index 4d2333e7ae06..dd1f62e731c9 100644 > --- a/Documentation/PCI/endpoint/index.rst > +++ b/Documentation/PCI/endpoint/index.rst > @@ -15,6 +15,7 @@ PCI Endpoint Framework > pci-ntb-howto > pci-vntb-function > pci-vntb-howto > + pci-nvme-function > > function/binding/pci-test > function/binding/pci-ntb > diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst > new file mode 100644 > index 000000000000..aedcfedf679b > --- /dev/null > +++ b/Documentation/PCI/endpoint/pci-nvme-function.rst > @@ -0,0 +1,14 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +================= > +PCI NVMe Function > +================= > + > +:Author: Damien Le Moal <dlemoal@kernel.org> > + > +The PCI NVMe endpoint function implements a PCI NVMe controller using the NVMe > +subsystem target core code. The driver for this function resides with the NVMe > +subsystem as drivers/nvme/target/nvmet-pciep.c. > + > +See Documentation/nvme/nvme-pci-endpoint-target.rst for more details. > + > diff --git a/Documentation/nvme/index.rst b/Documentation/nvme/index.rst > new file mode 100644 > index 000000000000..13383c760cc7 > --- /dev/null > +++ b/Documentation/nvme/index.rst > @@ -0,0 +1,12 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +============== > +NVMe Subsystem > +============== > + > +.. toctree:: > + :maxdepth: 2 > + :numbered: > + > + feature-and-quirk-policy > + nvme-pci-endpoint-target > diff --git a/Documentation/nvme/nvme-pci-endpoint-target.rst b/Documentation/nvme/nvme-pci-endpoint-target.rst > new file mode 100644 > index 000000000000..6a96f05daf01 > --- /dev/null > +++ b/Documentation/nvme/nvme-pci-endpoint-target.rst > @@ -0,0 +1,365 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +======================== > +NVMe PCI Endpoint Target > +======================== > + > +:Author: Damien Le Moal <dlemoal@kernel.org> > + > +The NVMe PCI endpoint target driver implements a PCIe NVMe controller using a > +NVMe fabrics target controller using the PCI transport type. > + > +Overview > +======== > + > +The NVMe PCI endpoint target driver allows exposing a NVMe target controller > +over a PCIe link, thus implementing an NVMe PCIe device similar to a regular > +M.2 SSD. The target controller is created in the same manner as when using NVMe > +over fabrics: the controller represents the interface to an NVMe subsystem > +using a port. The port transfer type must be configured to be "pci". The > +subsystem can be configured to have namespaces backed by regular files or block > +devices, or can use NVMe passthrough to expose an existing physical NVMe device > +or a NVMe fabrics host controller (e.g. a NVMe TCP host controller). > + > +The NVMe PCI endpoint target driver relies as much as possible on the NVMe > +target core code to parse and execute NVMe commands submitted by the PCI RC > +host. However, using the PCI endpoint framework API and DMA API, the driver is > +also responsible for managing all data transfers over the PCI link. This > +implies that the NVMe PCI endpoint target driver implements several NVMe data > +structure management and some command parsing. > + > +1) The driver manages retrieval of NVMe commands in submission queues using DMA > + if supported, or MMIO otherwise. Each command retrieved is then executed > + using a work item to maximize performance with the parallel execution of > + multiple commands on different CPUs. The driver uses a work item to > + constantly poll the doorbell of all submission queues to detect command > + submissions from the PCI RC host. > + > +2) The driver transfers completion queues entries of completed commands to the > + PCI RC host using MMIO copy of the entries in the host completion queue. > + After posting completion entries in a completion queue, the driver uses the > + PCI endpoint framework API to raise an interrupt to the host to signal the > + commands completion. > + > +3) For any command that has a data buffer, the NVMe PCI endpoint target driver > + parses the command PRPs or SGLs lists to create a list of PCI address > + segments representing the mapping of the command data buffer on the host. > + The command data buffer is transferred over the PCI link using this list of > + PCI address segments using DMA, if supported. If DMA is not supported, MMIO > + is used, which results in poor performance. For write commands, the command > + data buffer is transferred from the host into a local memory buffer before > + executing the command using the target core code. For read commands, a local > + memory buffer is allocated to execute the command and the content of that > + buffer is transferred to the host once the command completes. > + > +Controller Capabilities > +----------------------- > + > +The NVMe capabilities exposed to the PCI RC host through the BAR 0 registers > +are almost identical to the capabilities of the NVMe target controller > +implemented by the target core code. There are some exceptions. > + > +1) The NVMe PCI endpoint target driver always sets the controller capability > + CQR bit to request "Contiguous Queues Required". This is to facilitate the > + mapping of a queue PCI address range to the local CPU address space. > + > +2) The doorbell stride (DSTRB) is always set to be 4B > + > +3) Since the PCI endpoint framework does not provide a way to handle PCI level > + resets, the controller capability NSSR bit (NVM Subsystem Reset Supported) > + is always cleared. > + > +4) The boot partition support (BPS), Persistent Memory Region Supported (PMRS) > + and Controller Memory Buffer Supported (CMBS) capabilities are never reported. > + > +Supported Features > +------------------ > + > +The NVMe PCI endpoint target driver implements support for both PRPs and SGLs. > +The driver also implements IRQ vector coalescing and submission queue > +arbitration burst. > + > +The maximum number of queues and the maximum data transfer size (MDTS) are > +configurable through configfs before starting the controller. To avoid issues > +with excessive local memory usage for executing commands, MDTS defaults to 512 > +KB and is limited to a maximum of 2 MB (arbitrary limit). > + > +Mimimum number of PCI Address Mapping Windows Required > +------------------------------------------------------ > + > +Most PCI endpoint controllers provide a limited number of mapping windows for > +mapping a PCI address range to local CPU memory addresses. The NVMe PCI > +endpoint target controllers uses mapping windows for the following. > + > +1) One memory window for raising MSI or MSI-X interrupts > +2) One memory window for MMIO transfers > +3) One memory window for each completion queue > + > +Given the highly asynchronous nature of the NVMe PCI endpoint target driver > +operation, the memory windows as described above will generally not be used > +simultaneously, but that may happen. So a safe maximum number of completion > +queues that can be supported is equal to the total number of memory mapping > +windows of the PCI endpoint controller minus two. E.g. for an endpoint PCI > +controller with 32 outbound memory windows available, up to 30 completion > +queues can be safely operated without any risk of getting PCI address mapping > +errors due to the lack of memory windows. > + > +Maximum Number of Queue Pairs > +----------------------------- > + > +Upon binding of the NVMe PCI endpoint target driver to the PCI endpoint > +controller, BAR 0 is allocated with enough space to accommodate the admin queue > +and multiple I/O queues. The maximum of number of I/O queues pairs that can be > +supported is limited by several factors. > + > +1) The NVMe target core code limits the maximum number of I/O queues to the > + number of online CPUs. > +2) The total number of queue pairs, including the admin queue, cannot exceed > + the number of MSI-X or MSI vectors available. > +3) The total number of completion queues must not exceed the total number of > + PCI mapping windows minus 2 (see above). > + > +The NVMe endpoint function driver allows configuring the maximum number of > +queue pairs through configfs. > + > +Limitations and NVMe Specification Non-Compliance > +------------------------------------------------- > + > +Similar to the NVMe target core code, the NVMe PCI endpoint target driver does > +not support multiple submission queues using the same completion queue. All > +submission queues must specify a unique completion queue. > + > + > +User Guide > +========== > + > +This section describes the hardware requirements and how to setup an NVMe PCI > +endpoint target device. > + > +Kernel Requirements > +------------------- > + > +The kernel must be compiled with the configuration options CONFIG_PCI_ENDPOINT, > +CONFIG_PCI_ENDPOINT_CONFIGFS, and CONFIG_NVME_TARGET_PCI_EP enabled. > +CONFIG_PCI, CONFIG_BLK_DEV_NVME and CONFIG_NVME_TARGET must also be enabled > +(obviously). > + > +In addition to this, at least one PCI endpoint controller driver should be > +available for the endpoint hardware used. > + > +To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) > +is also recommended. With this, a simple setup using a null_blk block device > +as a subsystem namespace can be used. > + > +Hardware Requirements > +--------------------- > + > +To use the NVMe PCI endpoint target driver, at least one endpoint controller > +device is required. > + > +To find the list of endpoint controller devices in the system:: > + > + # ls /sys/class/pci_epc/ > + a40000000.pcie-ep > + > +If PCI_ENDPOINT_CONFIGFS is enabled:: > + > + # ls /sys/kernel/config/pci_ep/controllers > + a40000000.pcie-ep > + > +The endpoint board must of course also be connected to a host with a PCI cable > +with RX-TX signal swapped. If the host PCI slot used does not have > +plug-and-play capabilities, the host should be powered off when the NVMe PCI > +endpoint device is configured. > + > +NVMe Endpoint Device > +-------------------- > + > +Creating an NVMe endpoint device is a two step process. First, an NVMe target > +subsystem and port must be defined. Second, the NVMe PCI endpoint device must be > +setup and bound to the subsystem and port created. > + > +Creating a NVMe Subsystem and Port > +---------------------------------- > + > +Details about how to configure a NVMe target subsystem and port are outside the > +scope of this document. The following only provides a simple example of a port > +and subsystem with a single namespace backed by a null_blk device. > + > +First, make sure that configfs is enabled:: > + > + # mount -t configfs none /sys/kernel/config > + > +Next, create a null_blk device (default settings give a 250 GB device without > +memory backing). The block device created will be /dev/nullb0 by default:: > + > + # modprobe null_blk > + # ls /dev/nullb0 > + /dev/nullb0 > + > +The NVMe target core driver must be loaded:: > + > + # modprobe nvmet > + # lsmod | grep nvmet > + nvmet 118784 0 > + nvme_core 131072 1 nvmet > + > +Now, create a subsystem and a port that we will use to create a PCI target > +controller when setting up the NVMe PCI endpoint target device. In this > +example, the port is created with a maximum of 4 I/O queue pairs:: > + > + # cd /sys/kernel/config/nvmet/subsystems > + # mkdir nvmepf.0.nqn > + # echo -n "Linux-nvmet-pciep" > nvmepf.0.nqn/attr_model > + # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id > + # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id > + # echo 1 > nvmepf.0.nqn/attr_allow_any_host > + # echo 4 > nvmepf.0.nqn/attr_qid_max > + > +Next, create and enable the subsystem namespace using the null_blk block device:: > + > + # mkdir nvmepf.0.nqn/namespaces/1 > + # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path > + # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable" I have to do, 'echo 1 > nvmepf.0.nqn/namespaces/1/enable' > + > +Finally, create the target port and link it to the subsystem:: > + > + # cd /sys/kernel/config/nvmet/ports > + # mkdir 1 > + # echo -n "pci" > 1/addr_trtype > + # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \ > + /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn > + > +Creating a NVMe PCI Endpoint Device > +----------------------------------- > + > +With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint > +device can now be created and enabled. The NVMe PCI endpoint target driver > +should already be loaded (that is done automatically when the port is created):: > + > + # ls /sys/kernel/config/pci_ep/functions > + nvmet_pciep > + > +Next, create function 0:: > + > + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep > + # mkdir nvmepf.0 > + # ls nvmepf.0/ > + baseclass_code msix_interrupts secondary > + cache_line_size nvme subclass_code > + deviceid primary subsys_id > + interrupt_pin progif_code subsys_vendor_id > + msi_interrupts revid vendorid > + > +Configure the function using any vendor ID and device ID:: > + > + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep > + # echo 0x1b96 > nvmepf.0/vendorid > + # echo 0xBEEF > nvmepf.0/deviceid > + # echo 32 > nvmepf.0/msix_interrupts > + > +If the PCI endpoint controller used does not support MSIX, MSI can be > +configured instead:: > + > + # echo 32 > nvmepf.0/msi_interrupts > + > +Next, let's bind our endpoint device with the target subsystem and port that we > +created:: > + > + # echo 1 > nvmepf.0/portid 'echo 1 > nvmepf.0/nvme/portid' > + # echo "nvmepf.0.nqn" > nvmepf.0/subsysnqn 'echo 1 > nvmepf.0/nvme/subsysnqn' > + > +The endpoint function can then be bound to the endpoint controller and the > +controller started:: > + > + # cd /sys/kernel/config/pci_ep > + # ln -s functions/nvmet_pciep/nvmepf.0 controllers/a40000000.pcie-ep/ > + # echo 1 > controllers/a40000000.pcie-ep/start > + > +On the endpoint machine, kernel messages will show information as the NVMe > +target device and endpoint device are created and connected. > + For some reason, I cannot get the function driver working. Getting this warning on the ep: nvmet: connect request for invalid subsystem 1! I didn't debug it further. Will do it tomorrow morning and let you know. - Mani
On 2024/12/17 9:30, Manivannan Sadhasivam wrote: >> +Now, create a subsystem and a port that we will use to create a PCI target >> +controller when setting up the NVMe PCI endpoint target device. In this >> +example, the port is created with a maximum of 4 I/O queue pairs:: >> + >> + # cd /sys/kernel/config/nvmet/subsystems >> + # mkdir nvmepf.0.nqn >> + # echo -n "Linux-nvmet-pciep" > nvmepf.0.nqn/attr_model >> + # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id >> + # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id >> + # echo 1 > nvmepf.0.nqn/attr_allow_any_host >> + # echo 4 > nvmepf.0.nqn/attr_qid_max >> + >> +Next, create and enable the subsystem namespace using the null_blk block device:: >> + >> + # mkdir nvmepf.0.nqn/namespaces/1 >> + # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path >> + # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable" > > I have to do, 'echo 1 > nvmepf.0.nqn/namespaces/1/enable' Good catch. That is the old name from previous version. Will fix this. >> + >> +Finally, create the target port and link it to the subsystem:: >> + >> + # cd /sys/kernel/config/nvmet/ports >> + # mkdir 1 >> + # echo -n "pci" > 1/addr_trtype >> + # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \ >> + /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn >> + >> +Creating a NVMe PCI Endpoint Device >> +----------------------------------- >> + >> +With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint >> +device can now be created and enabled. The NVMe PCI endpoint target driver >> +should already be loaded (that is done automatically when the port is created):: >> + >> + # ls /sys/kernel/config/pci_ep/functions >> + nvmet_pciep >> + >> +Next, create function 0:: >> + >> + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep >> + # mkdir nvmepf.0 >> + # ls nvmepf.0/ >> + baseclass_code msix_interrupts secondary >> + cache_line_size nvme subclass_code >> + deviceid primary subsys_id >> + interrupt_pin progif_code subsys_vendor_id >> + msi_interrupts revid vendorid >> + >> +Configure the function using any vendor ID and device ID:: >> + >> + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep >> + # echo 0x1b96 > nvmepf.0/vendorid >> + # echo 0xBEEF > nvmepf.0/deviceid >> + # echo 32 > nvmepf.0/msix_interrupts >> + >> +If the PCI endpoint controller used does not support MSIX, MSI can be >> +configured instead:: >> + >> + # echo 32 > nvmepf.0/msi_interrupts >> + >> +Next, let's bind our endpoint device with the target subsystem and port that we >> +created:: >> + >> + # echo 1 > nvmepf.0/portid > > 'echo 1 > nvmepf.0/nvme/portid' > >> + # echo "nvmepf.0.nqn" > nvmepf.0/subsysnqn > > 'echo 1 > nvmepf.0/nvme/subsysnqn' Yep. Good catch. > >> + >> +The endpoint function can then be bound to the endpoint controller and the >> +controller started:: >> + >> + # cd /sys/kernel/config/pci_ep >> + # ln -s functions/nvmet_pciep/nvmepf.0 controllers/a40000000.pcie-ep/ >> + # echo 1 > controllers/a40000000.pcie-ep/start >> + >> +On the endpoint machine, kernel messages will show information as the NVMe >> +target device and endpoint device are created and connected. >> + > > For some reason, I cannot get the function driver working. Getting this warning > on the ep: > > nvmet: connect request for invalid subsystem 1! > > I didn't debug it further. Will do it tomorrow morning and let you know. Hmmm... Weird. You should not ever see a connect request/command at all. Can you try this script: https://github.com/damien-lemoal/buildroot/blob/rock5b_ep_v25/board/radxa/rock5b-ep/overlay/root/pci-ep/nvmet-pciep Just run "./nvmet-pciep start" after booting the endpoint board. The command example in the documentation is an extract from what this script does. I think that: echo 1 > ${SUBSYSNQN}/attr_allow_any_host missing may be the reason for this error.
diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 4d2333e7ae06..dd1f62e731c9 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -15,6 +15,7 @@ PCI Endpoint Framework pci-ntb-howto pci-vntb-function pci-vntb-howto + pci-nvme-function function/binding/pci-test function/binding/pci-ntb diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst new file mode 100644 index 000000000000..aedcfedf679b --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-function.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI NVMe Function +================= + +:Author: Damien Le Moal <dlemoal@kernel.org> + +The PCI NVMe endpoint function implements a PCI NVMe controller using the NVMe +subsystem target core code. The driver for this function resides with the NVMe +subsystem as drivers/nvme/target/nvmet-pciep.c. + +See Documentation/nvme/nvme-pci-endpoint-target.rst for more details. + diff --git a/Documentation/nvme/index.rst b/Documentation/nvme/index.rst new file mode 100644 index 000000000000..13383c760cc7 --- /dev/null +++ b/Documentation/nvme/index.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +NVMe Subsystem +============== + +.. toctree:: + :maxdepth: 2 + :numbered: + + feature-and-quirk-policy + nvme-pci-endpoint-target diff --git a/Documentation/nvme/nvme-pci-endpoint-target.rst b/Documentation/nvme/nvme-pci-endpoint-target.rst new file mode 100644 index 000000000000..6a96f05daf01 --- /dev/null +++ b/Documentation/nvme/nvme-pci-endpoint-target.rst @@ -0,0 +1,365 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +NVMe PCI Endpoint Target +======================== + +:Author: Damien Le Moal <dlemoal@kernel.org> + +The NVMe PCI endpoint target driver implements a PCIe NVMe controller using a +NVMe fabrics target controller using the PCI transport type. + +Overview +======== + +The NVMe PCI endpoint target driver allows exposing a NVMe target controller +over a PCIe link, thus implementing an NVMe PCIe device similar to a regular +M.2 SSD. The target controller is created in the same manner as when using NVMe +over fabrics: the controller represents the interface to an NVMe subsystem +using a port. The port transfer type must be configured to be "pci". The +subsystem can be configured to have namespaces backed by regular files or block +devices, or can use NVMe passthrough to expose an existing physical NVMe device +or a NVMe fabrics host controller (e.g. a NVMe TCP host controller). + +The NVMe PCI endpoint target driver relies as much as possible on the NVMe +target core code to parse and execute NVMe commands submitted by the PCI RC +host. However, using the PCI endpoint framework API and DMA API, the driver is +also responsible for managing all data transfers over the PCI link. This +implies that the NVMe PCI endpoint target driver implements several NVMe data +structure management and some command parsing. + +1) The driver manages retrieval of NVMe commands in submission queues using DMA + if supported, or MMIO otherwise. Each command retrieved is then executed + using a work item to maximize performance with the parallel execution of + multiple commands on different CPUs. The driver uses a work item to + constantly poll the doorbell of all submission queues to detect command + submissions from the PCI RC host. + +2) The driver transfers completion queues entries of completed commands to the + PCI RC host using MMIO copy of the entries in the host completion queue. + After posting completion entries in a completion queue, the driver uses the + PCI endpoint framework API to raise an interrupt to the host to signal the + commands completion. + +3) For any command that has a data buffer, the NVMe PCI endpoint target driver + parses the command PRPs or SGLs lists to create a list of PCI address + segments representing the mapping of the command data buffer on the host. + The command data buffer is transferred over the PCI link using this list of + PCI address segments using DMA, if supported. If DMA is not supported, MMIO + is used, which results in poor performance. For write commands, the command + data buffer is transferred from the host into a local memory buffer before + executing the command using the target core code. For read commands, a local + memory buffer is allocated to execute the command and the content of that + buffer is transferred to the host once the command completes. + +Controller Capabilities +----------------------- + +The NVMe capabilities exposed to the PCI RC host through the BAR 0 registers +are almost identical to the capabilities of the NVMe target controller +implemented by the target core code. There are some exceptions. + +1) The NVMe PCI endpoint target driver always sets the controller capability + CQR bit to request "Contiguous Queues Required". This is to facilitate the + mapping of a queue PCI address range to the local CPU address space. + +2) The doorbell stride (DSTRB) is always set to be 4B + +3) Since the PCI endpoint framework does not provide a way to handle PCI level + resets, the controller capability NSSR bit (NVM Subsystem Reset Supported) + is always cleared. + +4) The boot partition support (BPS), Persistent Memory Region Supported (PMRS) + and Controller Memory Buffer Supported (CMBS) capabilities are never reported. + +Supported Features +------------------ + +The NVMe PCI endpoint target driver implements support for both PRPs and SGLs. +The driver also implements IRQ vector coalescing and submission queue +arbitration burst. + +The maximum number of queues and the maximum data transfer size (MDTS) are +configurable through configfs before starting the controller. To avoid issues +with excessive local memory usage for executing commands, MDTS defaults to 512 +KB and is limited to a maximum of 2 MB (arbitrary limit). + +Mimimum number of PCI Address Mapping Windows Required +------------------------------------------------------ + +Most PCI endpoint controllers provide a limited number of mapping windows for +mapping a PCI address range to local CPU memory addresses. The NVMe PCI +endpoint target controllers uses mapping windows for the following. + +1) One memory window for raising MSI or MSI-X interrupts +2) One memory window for MMIO transfers +3) One memory window for each completion queue + +Given the highly asynchronous nature of the NVMe PCI endpoint target driver +operation, the memory windows as described above will generally not be used +simultaneously, but that may happen. So a safe maximum number of completion +queues that can be supported is equal to the total number of memory mapping +windows of the PCI endpoint controller minus two. E.g. for an endpoint PCI +controller with 32 outbound memory windows available, up to 30 completion +queues can be safely operated without any risk of getting PCI address mapping +errors due to the lack of memory windows. + +Maximum Number of Queue Pairs +----------------------------- + +Upon binding of the NVMe PCI endpoint target driver to the PCI endpoint +controller, BAR 0 is allocated with enough space to accommodate the admin queue +and multiple I/O queues. The maximum of number of I/O queues pairs that can be +supported is limited by several factors. + +1) The NVMe target core code limits the maximum number of I/O queues to the + number of online CPUs. +2) The total number of queue pairs, including the admin queue, cannot exceed + the number of MSI-X or MSI vectors available. +3) The total number of completion queues must not exceed the total number of + PCI mapping windows minus 2 (see above). + +The NVMe endpoint function driver allows configuring the maximum number of +queue pairs through configfs. + +Limitations and NVMe Specification Non-Compliance +------------------------------------------------- + +Similar to the NVMe target core code, the NVMe PCI endpoint target driver does +not support multiple submission queues using the same completion queue. All +submission queues must specify a unique completion queue. + + +User Guide +========== + +This section describes the hardware requirements and how to setup an NVMe PCI +endpoint target device. + +Kernel Requirements +------------------- + +The kernel must be compiled with the configuration options CONFIG_PCI_ENDPOINT, +CONFIG_PCI_ENDPOINT_CONFIGFS, and CONFIG_NVME_TARGET_PCI_EP enabled. +CONFIG_PCI, CONFIG_BLK_DEV_NVME and CONFIG_NVME_TARGET must also be enabled +(obviously). + +In addition to this, at least one PCI endpoint controller driver should be +available for the endpoint hardware used. + +To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) +is also recommended. With this, a simple setup using a null_blk block device +as a subsystem namespace can be used. + +Hardware Requirements +--------------------- + +To use the NVMe PCI endpoint target driver, at least one endpoint controller +device is required. + +To find the list of endpoint controller devices in the system:: + + # ls /sys/class/pci_epc/ + a40000000.pcie-ep + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/controllers + a40000000.pcie-ep + +The endpoint board must of course also be connected to a host with a PCI cable +with RX-TX signal swapped. If the host PCI slot used does not have +plug-and-play capabilities, the host should be powered off when the NVMe PCI +endpoint device is configured. + +NVMe Endpoint Device +-------------------- + +Creating an NVMe endpoint device is a two step process. First, an NVMe target +subsystem and port must be defined. Second, the NVMe PCI endpoint device must be +setup and bound to the subsystem and port created. + +Creating a NVMe Subsystem and Port +---------------------------------- + +Details about how to configure a NVMe target subsystem and port are outside the +scope of this document. The following only provides a simple example of a port +and subsystem with a single namespace backed by a null_blk device. + +First, make sure that configfs is enabled:: + + # mount -t configfs none /sys/kernel/config + +Next, create a null_blk device (default settings give a 250 GB device without +memory backing). The block device created will be /dev/nullb0 by default:: + + # modprobe null_blk + # ls /dev/nullb0 + /dev/nullb0 + +The NVMe target core driver must be loaded:: + + # modprobe nvmet + # lsmod | grep nvmet + nvmet 118784 0 + nvme_core 131072 1 nvmet + +Now, create a subsystem and a port that we will use to create a PCI target +controller when setting up the NVMe PCI endpoint target device. In this +example, the port is created with a maximum of 4 I/O queue pairs:: + + # cd /sys/kernel/config/nvmet/subsystems + # mkdir nvmepf.0.nqn + # echo -n "Linux-nvmet-pciep" > nvmepf.0.nqn/attr_model + # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id + # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id + # echo 1 > nvmepf.0.nqn/attr_allow_any_host + # echo 4 > nvmepf.0.nqn/attr_qid_max + +Next, create and enable the subsystem namespace using the null_blk block device:: + + # mkdir nvmepf.0.nqn/namespaces/1 + # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path + # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable" + +Finally, create the target port and link it to the subsystem:: + + # cd /sys/kernel/config/nvmet/ports + # mkdir 1 + # echo -n "pci" > 1/addr_trtype + # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \ + /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn + +Creating a NVMe PCI Endpoint Device +----------------------------------- + +With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint +device can now be created and enabled. The NVMe PCI endpoint target driver +should already be loaded (that is done automatically when the port is created):: + + # ls /sys/kernel/config/pci_ep/functions + nvmet_pciep + +Next, create function 0:: + + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep + # mkdir nvmepf.0 + # ls nvmepf.0/ + baseclass_code msix_interrupts secondary + cache_line_size nvme subclass_code + deviceid primary subsys_id + interrupt_pin progif_code subsys_vendor_id + msi_interrupts revid vendorid + +Configure the function using any vendor ID and device ID:: + + # cd /sys/kernel/config/pci_ep/functions/nvmet_pciep + # echo 0x1b96 > nvmepf.0/vendorid + # echo 0xBEEF > nvmepf.0/deviceid + # echo 32 > nvmepf.0/msix_interrupts + +If the PCI endpoint controller used does not support MSIX, MSI can be +configured instead:: + + # echo 32 > nvmepf.0/msi_interrupts + +Next, let's bind our endpoint device with the target subsystem and port that we +created:: + + # echo 1 > nvmepf.0/portid + # echo "nvmepf.0.nqn" > nvmepf.0/subsysnqn + +The endpoint function can then be bound to the endpoint controller and the +controller started:: + + # cd /sys/kernel/config/pci_ep + # ln -s functions/nvmet_pciep/nvmepf.0 controllers/a40000000.pcie-ep/ + # echo 1 > controllers/a40000000.pcie-ep/start + +On the endpoint machine, kernel messages will show information as the NVMe +target device and endpoint device are created and connected. + +.. code-block:: text + + null_blk: disk nullb0 created + null_blk: module loaded + nvmet: adding nsid 1 to subsystem nvmepf.0.nqn + nvmet_pciep nvmet_pciep.0: PCI endpoint controller supports MSI-X, 32 vectors + nvmet: Created nvm controller 1 for subsystem nvmepf.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:f82a09b7-9e14-4f77-903f-d0491e23611f. + nvmet_pciep nvmet_pciep.0: New PCI ctrl "nvmepf.0.nqn", 4 I/O queues, mdts 524288 B + +PCI Root-Complex Host +--------------------- + +Booting the PCI host will result in the initialization of the PCI link. This +will be signaled by the NVMe PCI endpoint target driver with a kernel message:: + + nvmet_pciep nvmet_pciep.0: PCI link up + +A kernel message on the endpoint will also signal when the host NVMe driver +enables the device controller:: + + nvmet_pciep nvmet_pciep.0: Enabling controller + +On the host side, the NVMe PCI endpoint target device will is discoverable +as a PCI device, with the vendor ID and device ID as configured:: + + # lspci -n + 0000:01:00.0 0108: 1b96:beef + +An this device will be recognized as an NVMe device with a single namespace:: + + # lsblk + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS + nvme0n1 259:0 0 250G 0 disk + +The NVMe endpoint block device can then be used as any other regular NVMe +device. For instance, the nvme command line utility can be used to get more +detailed information about the endpoint device:: + + # nvme id-ctrl /dev/nvme0 + NVME Identify Controller: + vid : 0x1b96 + ssvid : 0x1b96 + sn : 94993c85650ef7bcd625 + mn : Linux-nvmet-pciep + fr : 6.13.0-r + rab : 6 + ieee : 000000 + cmic : 0xb + mdts : 7 + cntlid : 0x1 + ver : 0x20100 + ... + + +Endpoint Bindings +================= + +The NVMe PCI endpoint target driver uses the PCI endpoint configfs device attributes as follows. + +================ =========================================================== +vendorid Anything is OK (e.g. PCI_ANY_ID) +deviceid Anything is OK (e.g. PCI_ANY_ID) +revid Do not care +progif_code Must be 0x02 (NVM Express) +baseclass_code Must be 0x01 (PCI_BASE_CLASS_STORAGE) +subclass_code Must be 0x08 (Non-Volatile Memory controller) +cache_line_size Do not care +subsys_vendor_id Anything is OK (e.g. PCI_ANY_ID) +subsys_id Anything is OK (e.g. PCI_ANY_ID) +msi_interrupts At least equal to the number of queue pairs desired +msix_interrupts At least equal to the number of queue pairs desired +interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported +================ =========================================================== + +The NVMe PCI endpoint target function also has some specific configurable +fields defined in the *nvme* subdirectory of the function directory. These +fields are as follows. + +================ =========================================================== +dma_enable Enable (1) or disable (0) DMA transfers (default: 1) +mdts_kb Maximum data transfer size in KiB (default: 512) +portid The ID of the target port to use +subsysnqn The NQN of the target subsystem to use +================ =========================================================== diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst index 74af50d2ef7f..b52ad5b969d4 100644 --- a/Documentation/subsystem-apis.rst +++ b/Documentation/subsystem-apis.rst @@ -60,6 +60,7 @@ Storage interfaces cdrom/index scsi/index target/index + nvme/index Other subsystems ----------------