mbox series

[RFC,v12,00/17] dlb: introduce DLB device driver

Message ID 20211221065047.290182-1-mike.ximing.chen@intel.com (mailing list archive)
Headers show
Series dlb: introduce DLB device driver | expand

Message

Chen, Mike Ximing Dec. 21, 2021, 6:50 a.m. UTC
The Intel(r) Dynamic Load Balancer (Intel(r) DLB) is a PCIe device and hardware
accelerator that provides load-balanced, prioritized scheduling for event based
workloads across CPU cores. It can be used to save CPU resources in high
throughput pipelines by replacing software based distribution and synchronization
schemes. Using Intel DLB has been demonstrated to reduce CPU utilization up to
2-3 CPU cores per DLB device, while also improving packet processing pipeline
performance.

In many applications running on processors with a large number of cores,
workloads must be distributed (load-balanced) across a number of cores.
In packet processing applications, for example, streams of incoming packets can
exceed the capacity of any single core. So they have to be divided between
available worker cores. The workload can be split by either breaking the
processing flow into stages and places distinct stages on separate cores in a
daisy chain fashion (a pipeline), or spraying packets across multiple workers
that may be executing the same processing stage. Many systems employ a hybrid
approach whereby each packet encounters multiple pipelined stages with
distribution across multiple workers at each individual stage.

Intel DLB distribution schemes include "parallel" (packets are load-balanced
across multiple cores and processed in parallel), "ordered" (similar to
"parallel" but packets are reordered into ingress order by the device), and
"atomic" (packet flows are scheduled to a single core at a time such that
locks are not required to access per-flow data, and dynamically migrated to
ensure load-balance).

Intel DLB consists of queues and arbiters that connect producer
cores and consumer cores. The device implements load-balanced queueing
features including:
- Lock-free multi-producer/multi-consumer operation.
- Multiple priority levels for varying traffic types.
- 'Direct' traffic (i.e. multi-producer/single-consumer)
- Simple unordered load-balanced distribution.
- Atomic lock free load balancing across multiple consumers.
- Queue element reordering feature allowing ordered load-balanced
  distribution.

The fundamental unit of communication through the device is a queue element
(QE), which consists of 8B of data and 8B of metadata (destination queue,
priority, etc.). The data field can be any type that fits within 8B.

A core's interface to the device, a "port," consists of a memory-mappable
region through which the core enqueues a queue entry, and an in-memory
queue (the "consumer queue") to which the device schedules QEs. Each QE
is enqueued to a device-managed queue, and from there scheduled to a port.
Software specifies the "linking" of queues and ports; i.e. which ports the
device is allowed to schedule to for a given queue. The device uses a
credit scheme to prevent overflow of the on-device queue storage.

Applications can interface directly with the device by mapping the port's
memory and MMIO regions into the application's address space for enqueue
and dequeue operations, but call into the kernel driver for configuration
operations. An application can also be polling- or interrupt-driven;
Intel DLB supports both modes of operation.

Device resources -- i.e. ports, queues, and credits -- are contained within
a scheduling domain. Scheduling domains are isolated from one another; a
port can only enqueue to and dequeue from queues within its scheduling
domain. A scheduling domain's resources are configured through a configfs
interface. Please refer to Documentation/misc-devices/dlb.rst (in patch 01/17)
for a detailed description on the DLB configfs implementation.

Intel DLB supports SR-IOV and Scalable IOV, and allows for a flexible
division of its resources among the PF and its virtual devices. The virtual
devices are incapable of configuring the device directly; they use a
hardware mailbox to proxy configuration requests to the PF driver. This
driver supports both PF and virtual devices, as there is significant code
re-use between the two, with device-specific behavior handled through a
callback interface.  Virtualization support will be added in a later patch
set.

The DLB driver uses configfs and sysfs as its primary interface While the
DLB sysfs allows users to configure DLB at the device level, the configfs
lets users to create and control DLB scheduling domains, ports and queues.
Configfs supports operations on a scheduling domain's resources
(primarily resource configuration).  Scheduling domains are created
dynamically by user-space software.

[1] https://builders.intel.com/docs/networkbuilders/SKU-343247-001US-queue-management-and-load-balancing-on-intel-architecture.pdf
[2] https://doc.dpdk.org/guides/prog_guide/eventdev.html

This submission is still a work in progress. We have replaced ioctl interface
with configfs and made a few other changes based on earlier reviews (see
https://lore.kernel.org/all/BYAPR11MB30952BA538BB905331392A08D9119@BYAPR11MB3095.namprd11.prod.outlook.com/).
There are a couple of issues that we would like to get help and suggestions
from reviewers and community.

1. Before a scheduling domain is created/enabled, a set of parameters are
passed to the kernel driver via configfs attribute files in an configfs domain
directory (say $domain) created by user. Each attribute file corresponds to
a configuration parameter of the domain. After writing to all the attribute
files, user writes 1 to "create" attribute, which triggers an action (i.e.,
domain creation) in the kernel driver. Since multiple processes/users can
access the $domain directory, multiple users can write to the attribute files
at the same time.  How do we guarantee an atomic update/configuration of a
domain? In other words, if user A wants to set attributes 1 and 2, how can we
prevent user B from changing attribute 1 and 2 before user A writes 1 to
"create"? A configfs directory with individual attribute files seems to not
be able to provide atomic configuration in this case. One option to solve this
issue could be write a structured data (with a set of parameters) to a single
attribute file. This would guarantee the atomic configuration, but may not be
a conventional configfs operation.
 
2. When a user space dlb application exits, it needs to tell the kernel driver
to reset the scheduling domain and return the associated resources back to the
resource pool for a future use. The application does so by write 0 to "create"
attribute file, and it works fine with a normal program exit. However when an
application is killed (for example, kill -9) by user, we need to a way to reset
the domain in the driver. What would be the best approach for the tear down
process in this case? In our current implementation (see patch 06/17) we create
and open an anon file descriptor when a domain is created/enabled. Since the
file is closed automatically when the user application exits or is killed, we
use the file close() operation in the driver to reset and tear down the domain.
Would this approach be acceptable with the configfs implementation? 

Dan points out that configfs puts atomic update and configuration teardown
responsibilities in userspacer. We are looking for a direction check on these
issues before doing deeper reworks.

v12:
- Address Dan's following feedbacks on coding stylei issues.
-- Remove DLB_HW_ERR() and DLB_HW_DBG() macros. Use dev_err() and dev_dbg()
   directly.
-- Replace FIELD_SET() macro with direct coding.
-- Remove all refererences on virt_id/phys_id. They will be introduced in
   future patches.
-- Use list_move() instead of list_del() and list_add() whenever possible.
-- Remove device revision numbers.
-- Reverse the order of iosubmit_cmds512() and wmb().
-- Other cleanups based on Dan's comments.

- The following coding style changes suggested by Dan will be implemented
  in the next revision
-- Replace DLB_CSR_RD() and DLB_CSR_WR() with direct ioread32() and
   iowrite32() call.
-- Remove bitmap wrappers and use linux bitmap functions directly.
-- Use trace_event in configfs attribute file update.

v11:
- Change the user interface from ioctls to configfs. Provide configfs
  interface for create and configure scheduling domains, queues, ports,
  and link/unlink of queues and ports.
- Address all of Greg's feedback on v10, including
-- Consolidate header files. Merged dlb_main.h, dlb_hw_types.h,
   dlb_resources.h and dlb_bitmap.h into dlb_main.h.
-- Removed device ops callbacks. They will be added back to when
   VM support is introduced.
-- Use macros/fucntions provided by linux kernel. Replace BITS_GET()
   and BITS_SET() by the existing linux kernel macros FIELD_GET()
   and FIELD_PREP().
-- Revise the DLB overview document dlb.rst and provide detailed
   descriptions on how DLB works.
- Remove configurations for sequeuence number and class of services.
- Remove all VF and VDEV (specially in dlb_resource.c) related code. Will
  add them back in a later patches.
- Add dlb sysfs for device level control and configurtion.
- Move dynamic port and queueu linking and unlinking to future patch
  set. This reduces the total patches in this set to 17 from 20 in v10.
  Only static linking and unlinking is supported in this submission.

v10:
- Addressed an issue reported by kernel test robot <lkp@intel.com>
-- Add "WITH Linux-syscall-note" to the SPDX-License-Identifier in uapi
   header file dlb.h.

v9:
- Addressed all of Greg's feecback on v8, including
-- Remove function name (__func__) from dev_err() messages, that could spam log.
-- Replace list and function pointer calls in dlb_ioctl() with switch-case
   and real function calls for ioctl.
-- Drop the compat_ptr_ioctl in dlb_ops (struct file_operations).
-- Change ioctl magic number for DLB to unused 0x81 (from 'h').
-- Remove all placeholder/dummy functions in the patch set.
-- Re-arrange the comments in dlb.h so that the order is consistent with that
   of data structures referred.
-- Correct the comments on SPDX License and DLB versions in dlb.h.
-- Replace BIT_SET() and BITS_CLR() marcos with direct coding.   
-- Remove NULL pointer checking (f->private_data) in dlb_ioctl().
-- Use whole line whenever possible and not wrapping lines unnecessarily.
-- Remove __attribute__((unused)).
-- Merge dlb_ioctl.h and dlb_file.h into dlb_main.h

v8:
- Add a functional block diagram in dlb.rst 
- Modify change logs to reflect the links between patches and DPDK
  eventdev library.
- Add a check of power-of-2 for CQ depth.
- Move call to INIT_WORK() to dlb_open().
- Clean dlb workqueue by calling flush_scheduled_work().
- Add unmap_mapping_range() in dlb_port_close().

v7 (Intel internal version):
- Address all of Dan's feedback, including
-- Drop DLB 2.0 throughout the patch set, use DLB only.
-- Fix license and copyright statements
-- Use pcim_enable_device() and pcim_iomap_regions(), instead of
   unmanaged version.
-- Move cdev_add() to dlb_init() and add all devices at once.
-- Fix Makefile, using "+=" style.
-- Remove FLR description and mention movdir64/enqcmd usage in doc.
-- Make the permission for the domain same as that for device for
   ioctl access.
-- Use idr instead of ida.
-- Add a lock in dlb_close() to prevent driver unbinding while ioctl
   coomands are in progress.
-- Remove wrappers that are used for code sharing between kernel driver
   and DPDK. 
- Address Pierre-Louis' feedback, including
-- Clean the warinings from checkpatch
-- Fix the warnings from "make W=1"

v6 (Intel internal version):
- Change the module name to dlb(from dlb2), which currently supports Intel
  DLB 2.0 only.
- Address all of Pierre-Louis' feedback on v5, including
-- Consolidate the two near-identical for loops in dlb2_release_domain_memory().
-- Remove an unnecessary "port = NULL" initialization
-- Consistently use curly braces on the *_LIST_FOR macros
   when the for-loop contents spans multiple lines.
-- Add a comment to the definition of DLB2FS_MAGIC
-- Remove always true if statemnets
-- Move the get_cos_bw mutex unlock call earlier to shorten the critical
   section.
- Address all of Dan's feedbacks, including
-- Replace the unions for register bits access with bitmask and shifts
-- Centralize the "to/from" user memory copies for ioctl functions.
-- Review ioctl design against Documentation/process/botching-up-ioctls.rst
-- Remove wraper functions for memory barriers.
-- Use ilog() to simplify a switch code block.
-- Add base-commit to cover letter.

v5 (Intel internal version):
- Reduce the scope of the initial patch set (drop the last 8 patches)
- Further decompose some of the remaining patches into multiple patches.
- Address all of Pierre-Louis' feedback, including:
-- Move kerneldoc to *.c files
-- Fix SPDX comment style
-- Add BAR macros
-- Improve/clarify struct dlb2_dev and struct device variable naming
-- Add const where missing
-- Clarify existing comments and add new ones in various places
-- Remove unnecessary memsets and zero-initialization
-- Remove PM abstraction, fix missing pm_runtime_allow(), and don't
   update PM refcnt when port files are opened and closed.
-- Convert certain ternary operations into if-statements
-- Out-line the CQ depth valid check
-- De-duplicate the logic in dlb2_release_device_memory()
-- Limit use of devm functions to allocating/freeing struct dlb2
- Address Ira's comments on dlb2.rst and correct commit messages that
  don't use the imperative voice.

v4:
- Move PCI device ID definitions into dlb2_hw_types.h, drop the VF definition
- Remove dlb2_dev_list
- Remove open/close functions and fops structure (unused)
- Remove "(char *)" cast from PCI driver name
- Unwind init failures properly
- Remove ID alloc helper functions and call IDA interfaces directly instead

v3:
- Remove DLB2_PCI_REG_READ/WRITE macros

v2:
- Change driver license to GPLv2 only
- Expand Kconfig help text and remove unnecessary (R)s
- Remove unnecessary prints
- Add a new entry in ioctl-number.rst
- Convert the ioctl handler into a switch statement
- Correct some instances of IOWR that should have been IOR
- Align macro blocks
- Don't break ioctl ABI when introducing new commands
- Remove indirect pointers from ioctl data structures
- Remove the get-sched-domain-fd ioctl command

Mike Ximing Chen (17):
  dlb: add skeleton for DLB driver
  dlb: initialize DLB device
  dlb: add resource and device initialization
  dlb: add configfs interface and scheduling domain directory
  dlb: add scheduling domain configuration
  dlb: add domain software reset
  dlb: add low-level register reset operations
  dlb: add runtime power-management support
  dlb: add queue create, reset, get-depth configfs interface
  dlb: add register operations for queue management
  dlb: add configfs interface to configure ports
  dlb: add register operations for port management
  dlb: add port mmap support
  dlb: add start domain configfs attribute
  dlb: add queue map, unmap, and pending unmap
  dlb: add static queue map register operations
  dlb: add basic sysfs interfaces

 Documentation/ABI/testing/sysfs-driver-dlb |  116 +
 Documentation/misc-devices/dlb.rst         |  323 ++
 Documentation/misc-devices/index.rst       |    1 +
 MAINTAINERS                                |    7 +
 drivers/misc/Kconfig                       |    1 +
 drivers/misc/Makefile                      |    1 +
 drivers/misc/dlb/Kconfig                   |   18 +
 drivers/misc/dlb/Makefile                  |    7 +
 drivers/misc/dlb/dlb_args.h                |  372 ++
 drivers/misc/dlb/dlb_configfs.c            | 1225 ++++++
 drivers/misc/dlb/dlb_configfs.h            |  195 +
 drivers/misc/dlb/dlb_file.c                |  149 +
 drivers/misc/dlb/dlb_main.c                |  616 +++
 drivers/misc/dlb/dlb_main.h                |  653 ++++
 drivers/misc/dlb/dlb_pf_ops.c              |  299 ++
 drivers/misc/dlb/dlb_regs.h                | 3640 ++++++++++++++++++
 drivers/misc/dlb/dlb_resource.c            | 3960 ++++++++++++++++++++
 include/uapi/linux/dlb.h                   |   40 +
 18 files changed, 11623 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-driver-dlb
 create mode 100644 Documentation/misc-devices/dlb.rst
 create mode 100644 drivers/misc/dlb/Kconfig
 create mode 100644 drivers/misc/dlb/Makefile
 create mode 100644 drivers/misc/dlb/dlb_args.h
 create mode 100644 drivers/misc/dlb/dlb_configfs.c
 create mode 100644 drivers/misc/dlb/dlb_configfs.h
 create mode 100644 drivers/misc/dlb/dlb_file.c
 create mode 100644 drivers/misc/dlb/dlb_main.c
 create mode 100644 drivers/misc/dlb/dlb_main.h
 create mode 100644 drivers/misc/dlb/dlb_pf_ops.c
 create mode 100644 drivers/misc/dlb/dlb_regs.h
 create mode 100644 drivers/misc/dlb/dlb_resource.c
 create mode 100644 include/uapi/linux/dlb.h


base-commit: 519d81956ee277b4419c723adfb154603c2565ba

Comments

Greg Kroah-Hartman Dec. 21, 2021, 7:09 a.m. UTC | #1
On Tue, Dec 21, 2021 at 12:50:30AM -0600, Mike Ximing Chen wrote:
> v12:

<snip>

How is a "RFC" series on version 12?  "RFC" means "I do not think this
should be merged, please give me some comments on how this is all
structured" which I think is not the case here.

> - The following coding style changes suggested by Dan will be implemented
>   in the next revision
> -- Replace DLB_CSR_RD() and DLB_CSR_WR() with direct ioread32() and
>    iowrite32() call.
> -- Remove bitmap wrappers and use linux bitmap functions directly.
> -- Use trace_event in configfs attribute file update.

Why submit a patch series that you know will be changed?  Just do the
work, don't ask anyone to review stuff you know is incorrect, that just
wastes our time and ensures that we never want to review it again.

greg k-h
Andrew Lunn Dec. 21, 2021, 9:40 a.m. UTC | #2
> 1. Before a scheduling domain is created/enabled, a set of parameters are
> passed to the kernel driver via configfs attribute files in an configfs domain
> directory (say $domain) created by user. Each attribute file corresponds to
> a configuration parameter of the domain. After writing to all the attribute
> files, user writes 1 to "create" attribute, which triggers an action (i.e.,
> domain creation) in the kernel driver. Since multiple processes/users can
> access the $domain directory, multiple users can write to the attribute files
> at the same time.  How do we guarantee an atomic update/configuration of a
> domain? In other words, if user A wants to set attributes 1 and 2, how can we
> prevent user B from changing attribute 1 and 2 before user A writes 1 to
> "create"? A configfs directory with individual attribute files seems to not
> be able to provide atomic configuration in this case. One option to solve this
> issue could be write a structured data (with a set of parameters) to a single
> attribute file. This would guarantee the atomic configuration, but may not be
> a conventional configfs operation.

How about throw away configfs and use netlink? Messages are atomic,
and you can add an arbitrary number of attributes to a single netlink
message. It will also make your code more network like, since nothing
else in the network stack uses configfs, as far as i know.

    Andrew
Chen, Mike Ximing Dec. 21, 2021, 2:03 p.m. UTC | #3
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, December 21, 2021 2:10 AM
> To: Chen, Mike Ximing <mike.ximing.chen@intel.com>
> Cc: linux-kernel@vger.kernel.org; arnd@arndb.de; Williams, Dan J <dan.j.williams@intel.com>; pierre-
> louis.bossart@linux.intel.com; netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org
> Subject: Re: [RFC PATCH v12 00/17] dlb: introduce DLB device driver
> 
> On Tue, Dec 21, 2021 at 12:50:30AM -0600, Mike Ximing Chen wrote:
> > v12:
> 
> <snip>
> 
> How is a "RFC" series on version 12?  "RFC" means "I do not think this should be merged, please give me
> some comments on how this is all structured" which I think is not the case here.

Hi Greg,

"RFC" here means exactly what you referred to. As you know we have made many changes since your
last review of the patch set (which was v10).  At this point we are not sure if we are on the right track in
terms of some configfs implementation, and would like some comments from the community. I stated
this in the cover letter before the change log: " This submission is still a work in progress.... , a couple of
issues that we would like to get help and suggestions from reviewers and community". I presented two
issues/questions we are facing, and would like to get comments. 

The code on the other hand are tested and validated on our hardware platforms. I kept the version number
in series (using v12, instead v1) so that reviewers can track the old submissions and have a better
understanding of the patch set's history.

> 
> > - The following coding style changes suggested by Dan will be implemented
> >   in the next revision
> > -- Replace DLB_CSR_RD() and DLB_CSR_WR() with direct ioread32() and
> >    iowrite32() call.
> > -- Remove bitmap wrappers and use linux bitmap functions directly.
> > -- Use trace_event in configfs attribute file update.
> 
> Why submit a patch series that you know will be changed?  Just do the work, don't ask anyone to review
> stuff you know is incorrect, that just wastes our time and ensures that we never want to review it again.
>
Since this is a RFC, and is not for merging or a full review, we though it was OK to log the pending coding
style changes. The patch set was submitted and reviewed by the community before, and there was no
complains on using macros like DLB_CSR_RD(), etc, but we think we can replace them for better
readability of the code.

Thanks
Mike
Greg Kroah-Hartman Dec. 21, 2021, 2:31 p.m. UTC | #4
On Tue, Dec 21, 2021 at 02:03:38PM +0000, Chen, Mike Ximing wrote:
> 
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Tuesday, December 21, 2021 2:10 AM
> > To: Chen, Mike Ximing <mike.ximing.chen@intel.com>
> > Cc: linux-kernel@vger.kernel.org; arnd@arndb.de; Williams, Dan J <dan.j.williams@intel.com>; pierre-
> > louis.bossart@linux.intel.com; netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org
> > Subject: Re: [RFC PATCH v12 00/17] dlb: introduce DLB device driver
> > 
> > On Tue, Dec 21, 2021 at 12:50:30AM -0600, Mike Ximing Chen wrote:
> > > v12:
> > 
> > <snip>
> > 
> > How is a "RFC" series on version 12?  "RFC" means "I do not think this should be merged, please give me
> > some comments on how this is all structured" which I think is not the case here.
> 
> Hi Greg,
> 
> "RFC" here means exactly what you referred to. As you know we have made many changes since your
> last review of the patch set (which was v10).  At this point we are not sure if we are on the right track in
> terms of some configfs implementation, and would like some comments from the community. I stated
> this in the cover letter before the change log: " This submission is still a work in progress.... , a couple of
> issues that we would like to get help and suggestions from reviewers and community". I presented two
> issues/questions we are facing, and would like to get comments. 
> 
> The code on the other hand are tested and validated on our hardware platforms. I kept the version number
> in series (using v12, instead v1) so that reviewers can track the old submissions and have a better
> understanding of the patch set's history.

"RFC" means "I have no idea if this is correct, I am throwing it out
there and anyone who also cares about this type of thing, please
comment".

A patch that is on "RFC 12" means, "We all have no clue how to do this,
we give up and hope you all will do it for us."

I almost never comment on RFC patch series, except for portions of the
kernel that I really care about.  For a brand-new subsystem like this,
that I still do not understand who needs it, that is not the case.

I'm going to stop reviewing this patch series until you at least follow
the Intel required rules for sending kernel patches like this out.  To
not do so would be unfair to your coworkers who _DO_ follow the rules.

> > > - The following coding style changes suggested by Dan will be implemented
> > >   in the next revision
> > > -- Replace DLB_CSR_RD() and DLB_CSR_WR() with direct ioread32() and
> > >    iowrite32() call.
> > > -- Remove bitmap wrappers and use linux bitmap functions directly.
> > > -- Use trace_event in configfs attribute file update.
> > 
> > Why submit a patch series that you know will be changed?  Just do the work, don't ask anyone to review
> > stuff you know is incorrect, that just wastes our time and ensures that we never want to review it again.
> >
> Since this is a RFC, and is not for merging or a full review, we though it was OK to log the pending coding
> style changes. The patch set was submitted and reviewed by the community before, and there was no
> complains on using macros like DLB_CSR_RD(), etc, but we think we can replace them for better
> readability of the code.

Coding style changes should NEVER be ignored and put off for later.
To do so means you do not care about the brains of anyone who you are
wanting to read this code.  We have a coding style because of brains and
pattern matching, not because we are being mean.

good luck,

greg k-h
Dan Williams Dec. 21, 2021, 6:44 p.m. UTC | #5
[ add Christoph for configfs feedback ]

On Tue, Dec 21, 2021 at 7:03 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Dec 21, 2021 at 02:03:38PM +0000, Chen, Mike Ximing wrote:
> >
> > > -----Original Message-----
> > > From: Greg KH <gregkh@linuxfoundation.org>
> > > Sent: Tuesday, December 21, 2021 2:10 AM
> > > To: Chen, Mike Ximing <mike.ximing.chen@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; arnd@arndb.de; Williams, Dan J <dan.j.williams@intel.com>; pierre-
> > > louis.bossart@linux.intel.com; netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org
> > > Subject: Re: [RFC PATCH v12 00/17] dlb: introduce DLB device driver
> > >
> > > On Tue, Dec 21, 2021 at 12:50:30AM -0600, Mike Ximing Chen wrote:
> > > > v12:
> > >
> > > <snip>
> > >
> > > How is a "RFC" series on version 12?  "RFC" means "I do not think this should be merged, please give me
> > > some comments on how this is all structured" which I think is not the case here.
> >
> > Hi Greg,
> >
> > "RFC" here means exactly what you referred to. As you know we have made many changes since your
> > last review of the patch set (which was v10).  At this point we are not sure if we are on the right track in
> > terms of some configfs implementation, and would like some comments from the community. I stated
> > this in the cover letter before the change log: " This submission is still a work in progress.... , a couple of
> > issues that we would like to get help and suggestions from reviewers and community". I presented two
> > issues/questions we are facing, and would like to get comments.
> >
> > The code on the other hand are tested and validated on our hardware platforms. I kept the version number
> > in series (using v12, instead v1) so that reviewers can track the old submissions and have a better
> > understanding of the patch set's history.
>
> "RFC" means "I have no idea if this is correct, I am throwing it out
> there and anyone who also cares about this type of thing, please
> comment".
>
> A patch that is on "RFC 12" means, "We all have no clue how to do this,
> we give up and hope you all will do it for us."
>
> I almost never comment on RFC patch series, except for portions of the
> kernel that I really care about.  For a brand-new subsystem like this,
> that I still do not understand who needs it, that is not the case.
>
> I'm going to stop reviewing this patch series until you at least follow
> the Intel required rules for sending kernel patches like this out.  To
> not do so would be unfair to your coworkers who _DO_ follow the rules.
>
> > > > - The following coding style changes suggested by Dan will be implemented
> > > >   in the next revision
> > > > -- Replace DLB_CSR_RD() and DLB_CSR_WR() with direct ioread32() and
> > > >    iowrite32() call.
> > > > -- Remove bitmap wrappers and use linux bitmap functions directly.
> > > > -- Use trace_event in configfs attribute file update.
> > >
> > > Why submit a patch series that you know will be changed?  Just do the work, don't ask anyone to review
> > > stuff you know is incorrect, that just wastes our time and ensures that we never want to review it again.
> > >
> > Since this is a RFC, and is not for merging or a full review, we though it was OK to log the pending coding
> > style changes. The patch set was submitted and reviewed by the community before, and there was no
> > complains on using macros like DLB_CSR_RD(), etc, but we think we can replace them for better
> > readability of the code.
>
> Coding style changes should NEVER be ignored and put off for later.
> To do so means you do not care about the brains of anyone who you are
> wanting to read this code.  We have a coding style because of brains and
> pattern matching, not because we are being mean.

Hey Greg,

This is my fault.

To date Mike has been patiently and diligently following my review
feedback to continue to make the driver smaller and more Linux
idiomatic. Primarily this has been ripping and replacing a pile of
object configuration ioctls with configfs. While my confidence in that
review feedback was high, my confidence in the current round of deeper
architecture reworks is lower and they seemed to raise questions that
are likely FAQs with using configfs. Specifically the observation that
configfs, like sysfs, lacks an "atomically update multiple attributes"
capability. To my knowledge that's just the expected tradeoff with
pseudo-fs based configuration and it is up to userspace to coordinate
multiple configuration writers.

The other question is the use of anon_inode_getfd(). To me that
mechanism is reserved for syscall and ioctl based architectures, and
in this case it was only being used as a mechanism to get an automatic
teardown action at process exit. Again, my inclination is that configs
requires userspace to clean up anything it created. If "tear down on
last close" behavior is needed that would either need to come from a
userspace daemon to watch clients, or another character device that
clients could open to represent the active users of the configuration.
My preference is for the former.

I green-lighted the work-in-progress / RFC posting  (with the known
style warts) to get momentum on just those questions. I thought it
better to not polish this driver to a shine and get some mid-rework
feedback. Mike continues to be a pleasure to work with, please take
any frustrations on how this was presented out on me, I'll do better
next time for these types of questions. DLB is unlike anything I have
reviewed previously.
Andrew Lunn Dec. 21, 2021, 7:57 p.m. UTC | #6
> Hey Greg,
> 
> This is my fault.
> 
> To date Mike has been patiently and diligently following my review
> feedback to continue to make the driver smaller and more Linux
> idiomatic. Primarily this has been ripping and replacing a pile of
> object configuration ioctls with configfs. While my confidence in that
> review feedback was high, my confidence in the current round of deeper
> architecture reworks is lower and they seemed to raise questions that
> are likely FAQs with using configfs. Specifically the observation that
> configfs, like sysfs, lacks an "atomically update multiple attributes"
> capability. To my knowledge that's just the expected tradeoff with
> pseudo-fs based configuration and it is up to userspace to coordinate
> multiple configuration writers.

Hi Dan

If this is considered a network accelerator, it probably should use
the same configuration mechanisms all of networking uses, netlink. I'm
not aware of anything network related using configfs, but it could
exist. netlink messages should also solve your atomisity problem.

But it does not really help with cleanup when the userspace user goes
away. Is there anything from GPU drivers which can be reused? They
must have some sort of cleanup when the user space DRM driver exits.

     Andrew
Chen, Mike Ximing Dec. 22, 2021, 4:37 a.m. UTC | #7
> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: Tuesday, December 21, 2021 4:41 AM
> To: Chen, Mike Ximing <mike.ximing.chen@intel.com>
> Cc: linux-kernel@vger.kernel.org; arnd@arndb.de; gregkh@linuxfoundation.org; Williams, Dan J
> <dan.j.williams@intel.com>; pierre-louis.bossart@linux.intel.com; netdev@vger.kernel.org;
> davem@davemloft.net; kuba@kernel.org
> Subject: Re: [RFC PATCH v12 00/17] dlb: introduce DLB device driver
> 
> > 1. Before a scheduling domain is created/enabled, a set of parameters
> > are passed to the kernel driver via configfs attribute files in an
> > configfs domain directory (say $domain) created by user. Each
> > attribute file corresponds to a configuration parameter of the domain.
> > After writing to all the attribute files, user writes 1 to "create"
> > attribute, which triggers an action (i.e., domain creation) in the
> > kernel driver. Since multiple processes/users can access the $domain
> > directory, multiple users can write to the attribute files at the same
> > time.  How do we guarantee an atomic update/configuration of a domain?
> > In other words, if user A wants to set attributes 1 and 2, how can we
> > prevent user B from changing attribute 1 and 2 before user A writes 1
> > to "create"? A configfs directory with individual attribute files
> > seems to not be able to provide atomic configuration in this case. One
> > option to solve this issue could be write a structured data (with a
> > set of parameters) to a single attribute file. This would guarantee the atomic configuration, but may not
> be a conventional configfs operation.
> 
> How about throw away configfs and use netlink? Messages are atomic, and you can add an arbitrary
> number of attributes to a single netlink message. It will also make your code more network like, since
> nothing else in the network stack uses configfs, as far as i know.
> 
Hi Andrew,
As I explained in my other response, DLB is not a network accelerator and DLB
driver is not a part of network stack. We would obviously prefer to resolve the
atomic update and resource reset at tear-down Issues within the configfs
framework if possible. But I will take a look at the netlink implementations.

Thanks for the suggestion
Mike
Christoph Hellwig Dec. 22, 2021, 8:01 a.m. UTC | #8
On Tue, Dec 21, 2021 at 10:44:11AM -0800, Dan Williams wrote:
> are likely FAQs with using configfs. Specifically the observation that
> configfs, like sysfs, lacks an "atomically update multiple attributes"
> capability. To my knowledge that's just the expected tradeoff with
> pseudo-fs based configuration and it is up to userspace to coordinate
> multiple configuration writers.

Yes.  For the SCSI and nvme targets we do a required attributes must
be set before something can be enabled, but that might not work
everywhere.

> The other question is the use of anon_inode_getfd(). To me that
> mechanism is reserved for syscall and ioctl based architectures, and

It is.

> in this case it was only being used as a mechanism to get an automatic
> teardown action at process exit. Again, my inclination is that configs
> requires userspace to clean up anything it created. If "tear down on
> last close" behavior is needed that would either need to come from a
> userspace daemon to watch clients, or another character device that
> clients could open to represent the active users of the configuration.
> My preference is for the former.

This really sounds like configfs is the wrong interface.  But I'd have
to find time to see what dlb actually is before commenting on what might
be a better interface.