diff mbox

[rdma-core,1/2] verbs: Report the device's PCI write end paddding capability

Message ID 1510667228-18579-2-git-send-email-yishaih@mellanox.com (mailing list archive)
State Accepted
Headers show

Commit Message

Yishai Hadas Nov. 14, 2017, 1:47 p.m. UTC
From: Noa Osherovich <noaos@mellanox.com>

There are PCIe root complex that are able to optimize their
performance when incoming data is multiple full cache lines.
Expose the device capability to report whether the device supports
padding the ending of incoming packets to full cache line, such that
the last upstream write generated by the incoming packet will be a
full cache line.

User should consider several factors before activating this feature:
- In case of high CPU memory load (which may cause PCI backpressure in
  turn), if a large percent of the writes are partial cache line, this
  feature should be checked as an optional solution.
- This feature might reduce performance if most packets are between
  one and two cache lines and PCIe throughput has reached its maximum
  capacity. E.g. 65B packet from the network port will lead to 128B
  write on PCIe, which may cause trafiic on PCIe to reach high
  throughput.

Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
---
 libibverbs/examples/devinfo.c | 5 ++++-
 libibverbs/verbs.h            | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Jason Gunthorpe Nov. 14, 2017, 8:30 p.m. UTC | #1
On Tue, Nov 14, 2017 at 03:47:07PM +0200, Yishai Hadas wrote:
> From: Noa Osherovich <noaos@mellanox.com>
> 
> There are PCIe root complex that are able to optimize their
> performance when incoming data is multiple full cache lines.
> Expose the device capability to report whether the device supports
> padding the ending of incoming packets to full cache line, such that
> the last upstream write generated by the incoming packet will be a
> full cache line.
> 
> User should consider several factors before activating this feature:
> - In case of high CPU memory load (which may cause PCI backpressure in
>   turn), if a large percent of the writes are partial cache line, this
>   feature should be checked as an optional solution.
> - This feature might reduce performance if most packets are between
>   one and two cache lines and PCIe throughput has reached its maximum
>   capacity. E.g. 65B packet from the network port will lead to 128B
>   write on PCIe, which may cause trafiic on PCIe to reach high
>   throughput.

This commit message would make a far better man page revision than
what was provided :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 14, 2017, 8:33 p.m. UTC | #2
On Tue, Nov 14, 2017 at 03:47:07PM +0200, Yishai Hadas wrote:

>   * enum range is limited to 4 bytes.
>   */
>  #define IBV_DEVICE_RAW_SCATTER_FCS (1ULL << 34)
> +#define IBV_DEVICE_PCI_WRITE_END_PADDING (1ULL << 36)

Man page?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas Nov. 15, 2017, 1:10 p.m. UTC | #3
On 11/14/2017 10:33 PM, Jason Gunthorpe wrote:
> On Tue, Nov 14, 2017 at 03:47:07PM +0200, Yishai Hadas wrote:
> 
>>    * enum range is limited to 4 bytes.
>>    */
>>   #define IBV_DEVICE_RAW_SCATTER_FCS (1ULL << 34)
>> +#define IBV_DEVICE_PCI_WRITE_END_PADDING (1ULL << 36)
> 
> Man page?
> 

The man page was updated with a detailed description on the above 
capability, see PR:
https://github.com/linux-rdma/rdma-core/pull/250
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 15, 2017, 3:15 p.m. UTC | #4
On Wed, Nov 15, 2017 at 03:10:30PM +0200, Yishai Hadas wrote:

> The man page was updated with a detailed description on the above
> capability, see PR:
> https://github.com/linux-rdma/rdma-core/pull/250

Yah, that is nicer

Here is some copy-editing:

Extended device capability flags (device_cap_flags_ex):
.br
.TP 7
IBV_DEVICE_PCI_WRITE_END_PADDING

Indicates the device has support for padding PCI writes to a full cache line.

Padding packets to full cache lines reduces the amount of traffic required at
the memory controller at the expense of creating more traffic on the PCI-E
port.

Workloads that have a high CPU memory load and low PCI-E utilization will
benefit from this feature, while workloads that have a high PCI-E utilization
and small packets will be harmed.

For instance, with a 128 byte cache line size, the transfer of any packets
less than 128 bytes will require a full 128 transfer on PCI, pontentially
doubling the required PCI-E bandwith.

This feature can be enabled on a QP or WQ basis via the
IBV_QP_CREATE_PCI_WRITE_END_PADDING or IBV_WQ_FLAGS_PCI_WRITE_END_PADDING
flags.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas Nov. 15, 2017, 4:18 p.m. UTC | #5
On 11/15/2017 5:15 PM, Jason Gunthorpe wrote:
> On Wed, Nov 15, 2017 at 03:10:30PM +0200, Yishai Hadas wrote:
> 
>> The man page was updated with a detailed description on the above
>> capability, see PR:
>> https://github.com/linux-rdma/rdma-core/pull/250
> 
> Yah, that is nicer
> 
> Here is some copy-editing:
> 
> Extended device capability flags (device_cap_flags_ex):
> .br
> .TP 7
> IBV_DEVICE_PCI_WRITE_END_PADDING
> 
> Indicates the device has support for padding PCI writes to a full cache line.
> 
> Padding packets to full cache lines reduces the amount of traffic required at
> the memory controller at the expense of creating more traffic on the PCI-E
> port.
> 
> Workloads that have a high CPU memory load and low PCI-E utilization will
> benefit from this feature, while workloads that have a high PCI-E utilization
> and small packets will be harmed.
> 
> For instance, with a 128 byte cache line size, the transfer of any packets
> less than 128 bytes will require a full 128 transfer on PCI, pontentially
> doubling the required PCI-E bandwith.
> 
> This feature can be enabled on a QP or WQ basis via the
> IBV_QP_CREATE_PCI_WRITE_END_PADDING or IBV_WQ_FLAGS_PCI_WRITE_END_PADDING
> flags.
> 

OK, PR was updated accordingly.
https://github.com/linux-rdma/rdma-core/pull/250
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/libibverbs/examples/devinfo.c b/libibverbs/examples/devinfo.c
index 169da2e..d02952e 100644
--- a/libibverbs/examples/devinfo.c
+++ b/libibverbs/examples/devinfo.c
@@ -331,10 +331,13 @@  static void print_odp_caps(const struct ibv_odp_caps *caps)
 static void print_device_cap_flags_ex(uint64_t device_cap_flags_ex)
 {
 	uint64_t ex_flags = device_cap_flags_ex & 0xffffffff00000000ULL;
-	uint64_t unknown_flags = ~(IBV_DEVICE_RAW_SCATTER_FCS);
+	uint64_t unknown_flags = ~(IBV_DEVICE_RAW_SCATTER_FCS |
+				   IBV_DEVICE_PCI_WRITE_END_PADDING);
 
 	if (ex_flags & IBV_DEVICE_RAW_SCATTER_FCS)
 		printf("\t\t\t\t\tRAW_SCATTER_FCS\n");
+	if (ex_flags & IBV_DEVICE_PCI_WRITE_END_PADDING)
+		printf("\t\t\t\t\tPCI_WRITE_END_PADDING\n");
 	if (ex_flags & unknown_flags)
 		printf("\t\t\t\t\tUnknown flags: 0x%" PRIX64 "\n",
 		       ex_flags & unknown_flags);
diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h
index 3e543cb..025e321 100644
--- a/libibverbs/verbs.h
+++ b/libibverbs/verbs.h
@@ -142,6 +142,7 @@  enum ibv_device_cap_flags {
  * enum range is limited to 4 bytes.
  */
 #define IBV_DEVICE_RAW_SCATTER_FCS (1ULL << 34)
+#define IBV_DEVICE_PCI_WRITE_END_PADDING (1ULL << 36)
 
 enum ibv_atomic_cap {
 	IBV_ATOMIC_NONE,