Message ID | 20240930201358.2638665-1-aahringo@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | dlm: net-namespace functionality | expand |
>>>>> "Alexander" == Alexander Aring <aahringo@redhat.com> writes: > Hi, > this patch-series is huge but brings a lot of basic "fun" net-namespace > functionality to DLM. Currently you need a couple of Linux kernel Please spell out TLAs like DLM the first time you use them. In this case I'm suer you mean Distributed Lock Manager, right? > instances running in e.g. Virtual Machines. With this patch-series I > want to break out of this virtual machine world dealing with multiple > kernels need to boot them all individually, etc. Now you can use DLM in > only one Linux kernel instance and each "node" (previously represented > by a virtual machine) is separate by a net-namespace. Why > net-namespaces? It just fits to the DLM design for now, you need to have > them anyway because the internal DLM socket handling on a per node > basis. What we do additionally is to separate the DLM lockspaces (the > lockspace that is being registered) by net-namespaces as this represents > a "network entity" (node). There might be reasons to introduce a > complete new kind of namespaces (locking namespace?) but I don't want to > do this step now and as I said net-namespaces are required anyway for > the DLM sockets. This section needs to be re-written to more clearly explain what you're trying to accomplish here, and how this is different or better than what went before. I realize you probably have this knowledge all internalized, but spelling it out in a clear and simple manner would be helpful to everyone. > You need some new user space tooling as a new netlink net-namespace > aware UAPI is introduced (but can co-exist with configfs that operates > on init_net only). See [0] for more steps, there is a copr repo for the > new tooling and can be enabled by: What the heck is a 'copr'? > $ dnf copr enable aring/nldlm > $ dnf install nldlm > or compile it yourself. These steps really entirely ignore the _why_ you would do this. And assume RedHad based systems. > Then there is currently a very simple script [1] to show a 3 nodes cluster nit: 3 node cluster > using gfs2 on a multiple loop block devices on a shared loop block device > image (sounds weird but I do something like that). There are currently > some user space synchronization issues that I solve by simple sleeps, > but they are only user space problems. Can you give the example on how to do this setup? Ideally in another patch which updates the Documentation/??? file to in the kernel tree. > To test it I recommend some virtual machine "but only one" and run the I'm having a hard time parsing this, please be more careful with singular or plural usage. English is hard! :-) > [1] script. Afterwards you have in your executed net-namespace the 3 > mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs > operations on those mountpoints acts as a per node entity operation. Which means what? So if I write to /cluster/node1/foo, it shows up in the other two mount points? Or do I need to create a filesystem on top? > We can use it for testing, development and also scale testing to have a > large number of nodes joining a lockspace (which seems to be a problem > right now). Instead of running 1000 vms, we can run 1000 net-namespaces > in a more resource limited environment. For me it seems gfs2 can handle > several mounts and still separate the resource according their global > variables. Their data structures e.g. glock hash seems to have in their > key a separation for that (fsid?). However this is still an experimental > feature we might run into issues that requires more separation related > to net-namespaces. However basic testing seems to run just fine. So is this all just to make testing and development easier so you don't need 10 or 1000 nodes to do stress testing? Would anyone use this in real life? > Limitations > I disable any functionality for the DLM character device that allow > plock handling or do DLM locking from user space. Just don't use any > plock locking in gfs2 for now. But basic vfs operations should work. You > can even sniff DLM traffic on the created "dlmsw" virtual bridge. So... what functionality is exposed by this patchset? And Maybe add in an "Advantages" section to explain why this is so good. Thanks! John > - Alex > [0] https://gitlab.com/netcoder/nldlm > [1] https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes > changes since v2: > - move to ynl and introduce and use netlink yaml spec > - put the nldlm.h DLM netlink header under UAPI directory > - fix build issues building with CONFIG_NET disabled > - fix possible nullpointer deference if lookup of lockspace failed > Alexander Aring (12): > dlm: introduce dlm_find_lockspace_name() > dlm: disallow different configs nodeid storages > dlm: add struct net to dlm_new_lockspace() > dlm: handle port as __be16 network byte order > dlm: use dlm_config as only cluster configuration > dlm: dlm_config_info config fields to unsigned int > dlm: rename config to configfs > kobject: add kset_type_create_and_add() helper > kobject: export generic helper ops > dlm: separate dlm lockspaces per net-namespace > dlm: add nldlm net-namespace aware UAPI > gfs2: separate mount context by net-namespaces > Documentation/netlink/specs/nldlm.yaml | 438 ++++++++ > drivers/md/md-cluster.c | 3 +- > fs/dlm/Makefile | 3 + > fs/dlm/config.c | 1291 +++++++++-------------- > fs/dlm/config.h | 215 +++- > fs/dlm/configfs.c | 882 ++++++++++++++++ > fs/dlm/configfs.h | 19 + > fs/dlm/debug_fs.c | 24 +- > fs/dlm/dir.c | 4 +- > fs/dlm/dlm_internal.h | 24 +- > fs/dlm/lock.c | 64 +- > fs/dlm/lock.h | 3 +- > fs/dlm/lockspace.c | 220 ++-- > fs/dlm/lockspace.h | 12 +- > fs/dlm/lowcomms.c | 525 +++++----- > fs/dlm/lowcomms.h | 29 +- > fs/dlm/main.c | 5 - > fs/dlm/member.c | 36 +- > fs/dlm/midcomms.c | 287 ++--- > fs/dlm/midcomms.h | 31 +- > fs/dlm/netlink2.c | 1330 ++++++++++++++++++++++++ > fs/dlm/nldlm-kernel.c | 290 ++++++ > fs/dlm/nldlm-kernel.h | 50 + > fs/dlm/nldlm.c | 847 +++++++++++++++ > fs/dlm/plock.c | 2 +- > fs/dlm/rcom.c | 16 +- > fs/dlm/rcom.h | 3 +- > fs/dlm/recover.c | 17 +- > fs/dlm/user.c | 63 +- > fs/dlm/user.h | 2 +- > fs/gfs2/glock.c | 8 + > fs/gfs2/incore.h | 2 + > fs/gfs2/lock_dlm.c | 6 +- > fs/gfs2/ops_fstype.c | 5 + > fs/gfs2/sys.c | 35 +- > fs/ocfs2/stack_user.c | 2 +- > include/linux/dlm.h | 9 +- > include/linux/kobject.h | 10 +- > include/uapi/linux/nldlm.h | 153 +++ > lib/kobject.c | 65 +- > 40 files changed, 5566 insertions(+), 1464 deletions(-) > create mode 100644 Documentation/netlink/specs/nldlm.yaml > create mode 100644 fs/dlm/configfs.c > create mode 100644 fs/dlm/configfs.h > create mode 100644 fs/dlm/netlink2.c > create mode 100644 fs/dlm/nldlm-kernel.c > create mode 100644 fs/dlm/nldlm-kernel.h > create mode 100644 fs/dlm/nldlm.c > create mode 100644 include/uapi/linux/nldlm.h > -- > 2.43.0
Hi, On Mon, Sep 30, 2024 at 4:49 PM John Stoffel <john@stoffel.org> wrote: > > >>>>> "Alexander" == Alexander Aring <aahringo@redhat.com> writes: > > > Hi, > > this patch-series is huge but brings a lot of basic "fun" net-namespace > > functionality to DLM. Currently you need a couple of Linux kernel > > Please spell out TLAs like DLM the first time you use them. In this > case I'm suer you mean Distributed Lock Manager, right? > Yes, DLM stands for Distributed Lock Manager that lives currently in "fs/dlm". > > instances running in e.g. Virtual Machines. With this patch-series I > > want to break out of this virtual machine world dealing with multiple > > kernels need to boot them all individually, etc. Now you can use DLM in > > only one Linux kernel instance and each "node" (previously represented > > by a virtual machine) is separate by a net-namespace. Why > > net-namespaces? It just fits to the DLM design for now, you need to have > > them anyway because the internal DLM socket handling on a per node > > basis. What we do additionally is to separate the DLM lockspaces (the > > lockspace that is being registered) by net-namespaces as this represents > > a "network entity" (node). There might be reasons to introduce a > > complete new kind of namespaces (locking namespace?) but I don't want to > > do this step now and as I said net-namespaces are required anyway for > > the DLM sockets. > > This section needs to be re-written to more clearly explain what > you're trying to accomplish here, and how this is different or better > than what went before. I realize you probably have this knowledge all > internalized, but spelling it out in a clear and simple manner would > be helpful to everyone. > Okay, I'll try my best next time. Usually lockspaces are separated by a per node instance as a different "network entity" with net-namespaces. I separate them instead of building a different "network entity" as a virtual machine that runs a different Linux kernel instance. There might be a question if DLM lockspaces should be separated by net-namespace or yet another "locking" namespace can be introduced? I don't want to go this step yet as lockspaces are separated by a "network entity" anyway. > > You need some new user space tooling as a new netlink net-namespace > > aware UAPI is introduced (but can co-exist with configfs that operates > > on init_net only). See [0] for more steps, there is a copr repo for the > > new tooling and can be enabled by: > > What the heck is a 'copr'? > That is just a binary repo for rpm packages. Some users may find it handy. > > > $ dnf copr enable aring/nldlm > > $ dnf install nldlm > > > or compile it yourself. > > These steps really entirely ignore the _why_ you would do this. And > assume RedHad based systems. > That is correct. I will mention that those steps are only for those specific systems. > > Then there is currently a very simple script [1] to show a 3 nodes cluster > > nit: 3 node cluster > > > using gfs2 on a multiple loop block devices on a shared loop block device > > image (sounds weird but I do something like that). There are currently > > some user space synchronization issues that I solve by simple sleeps, > > but they are only user space problems. > > Can you give the example on how to do this setup? Ideally in another > patch which updates the Documentation/??? file to in the kernel tree. > https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes As I quote with [1]. Okay, I will move them away from my separate repository and add them in Documentation/ > > To test it I recommend some virtual machine "but only one" and run the > > I'm having a hard time parsing this, please be more careful with > singular or plural usage. English is hard! :-) > > > [1] script. Afterwards you have in your executed net-namespace the 3 > > mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs > > operations on those mountpoints acts as a per node entity operation. > > Which means what? So if I write to /cluster/node1/foo, it shows up in > the other two mount points? Or do I need to create a filesystem on > top? > Now we are at a point where I think nobody does it in such a way before. I create a "fake" shared block device with 3 block devices: /dev/loop1, /dev/loop2, /dev/loop3 and they all point to the same filesystem image. Then create only once the gfs2 filesystem on it. Afterwards you can call mount with each process context in the previously mentioned "network entity" for each block device in their "imagined" assigned network entity. The example script does a mount from each net-namespace in the executed net-namespace and you can access each per "network entity" mountpoint per /cluster/node1, /cluster/node2, /cluster/node3 on the executed net-namespace context. Yes when you call touch /cluster/node1/foo it should show up in the other mountpoints. > > We can use it for testing, development and also scale testing to have a > > large number of nodes joining a lockspace (which seems to be a problem > > right now). Instead of running 1000 vms, we can run 1000 net-namespaces > > in a more resource limited environment. For me it seems gfs2 can handle > > several mounts and still separate the resource according their global > > variables. Their data structures e.g. glock hash seems to have in their > > key a separation for that (fsid?). However this is still an experimental > > feature we might run into issues that requires more separation related > > to net-namespaces. However basic testing seems to run just fine. > > So is this all just to make testing and development easier so you > don't need 10 or 1000 nodes to do stress testing? Would anyone use > this in real life? > Stress testing maybe, development easier for sure. There are scaling issues with the recovery handling and handling about ~100 nodes related that DLM will stop all lockspace activity when nodes join/leave, that is something I want to look at when I hopefully have this patch series upstream. Another example is the DLM lock verifier [0], and I need to be careful with the name lock verifier. It verifies that only compatible lock modes can be in use at the same time on a per "network entity" basis. This is the fundamental mechanism of DLM, if this does not work DLM is broken. We can do that now because we know the whole cluster information. We can confirm on any payload that DLM works correctly. For me, this alone is worth having this feature. For example, we can introduce a new sort of cluster file system xfstests, touch /cluster/node1/foo and check if the file shows up in /cluster/node2/foo. That is an easy example, sometimes we need to synchronize vfs operations and check them on the other "network entity". With this feature we don't need to synchronize our "testing script" over the network anymore with other processes running on other "network entities". In real life there is maybe not an example yet. Maybe when people start to use DLM for user space locking on a container basis, but this requires net-namespace user space locking functionality that is a future step. > > Limitations > > > I disable any functionality for the DLM character device that allow > > plock handling or do DLM locking from user space. Just don't use any > > plock locking in gfs2 for now. But basic vfs operations should work. You > > can even sniff DLM traffic on the created "dlmsw" virtual bridge. > > So... what functionality is exposed by this patchset? And Maybe add > in an "Advantages" section to explain why this is so good. > Sure, it is important to mention that this net-namespace functionality is experimental. If you use DLM without changing the net-namespace process context it should work as before, in this case there are no limitations. - Alex [0] https://lore.kernel.org/gfs2/20240827180236.316946-1-aahringo@redhat.com/T/#t