Message ID | 20170713182018.GE11069@obsidianresearch.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote: > Here is an untested attempt to move module loading into srp_daemon - > what do you think Bart? Hello Jason, Thanks for the patch. To me this looks like a nice simplification compared to the current approach. Do you want me to test this patch? Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote: > On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote: > > Here is an untested attempt to move module loading into srp_daemon - > > what do you think Bart? > > Hello Jason, > > Thanks for the patch. To me this looks like a nice simplification compared > to the current approach. Do you want me to test this patch? I did some more steps.. Benjamin/Bart let me know what you think, I'll send the patches to the list if you like the idea. https://github.com/jgunthorpe/rdma-plumbing/tree/systemd This replaces Benjamins rdma.conf based approach. The basic approach is to directly use systemd to load the modules and determine what modules to load by using udev and explicit Requires in the units. Eg, the extra modules for mlx4 are loaded by triggering this udev rule: SUBSYSTEM=="module", KERNEL=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" Which causes the module loader to load the contents of the modules list in /etc/rdma/modules/mlx4.conf If the user does not want a specific to load then they can comment out the module line in the /etc/rdma/modules files or use the usual module black list scheme. The interesting thing about this is how the boot ordering works, since rdma-load-modules@.service is a 'before' sysinit.target it will run before normal system startup tasks if the unit is scheduled to run early enough. I expect that boot time module loads triggered by udev will be early enough.. This makes it much closer to normal module loading, instead of being so strange and should eliminate much of the need for explicit rdma.service dependencies. The places that need something special, like srp, can directly depend on their specific module loader: Requires=rdma-load-modules@srp_daemon Again, if srp_daemon is scheduled to start during system boot this should result in the modules being loaded together with the rest of the modules before sysinit.target. The one nit that I don't have an easy solution for is starting modules depending on the device technology, eg not starting srp or ipoib on !IB devices. The issue is that I can't think of an easy way to detect the device technology from the udev rule, at least not without a helper script or additional kernel sysfs.. I havne't been able to test this yet, help would be appreciated. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote: > The one nit that I don't have an easy solution for is starting modules > depending on the device technology, eg not starting srp or ipoib on > !IB devices. The issue is that I can't think of an easy way to detect > the device technology from the udev rule, at least not without a > helper script or additional kernel sysfs.. Are users expected to modify the kernel-boot/modules/*.conf files? I expect that only a minority of the RDMA users will want to load the ib_srpt and ib_isert kernel modules. > I havne't been able to test this yet, help would be appreciated. I will try to free up some time to test the SRP-related changes. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 13, 2017 at 10:36:49PM +0000, Bart Van Assche wrote: > On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote: > > The one nit that I don't have an easy solution for is starting modules > > depending on the device technology, eg not starting srp or ipoib on > > !IB devices. The issue is that I can't think of an easy way to detect > > the device technology from the udev rule, at least not without a > > helper script or additional kernel sysfs.. > > Are users expected to modify the kernel-boot/modules/*.conf files? I > expect that only a minority of the RDMA users will want to load the ib_srpt > and ib_isert kernel modules. I'd say in the same way they would be expected to modify the RedHat /etc/rdma/rdma.conf file. I copied the defaults from that file, which was to autoload target drivers. I agree with you, they probably don't make sense to be defaulted on. Ideally their respective supporting packages or kernel would ensure the modules autoload if someone is working with RDMA... I know nothing about the target infrastructure - can you see a way to do that? Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2017-07-13 at 16:41 -0600, Jason Gunthorpe wrote: > On Thu, Jul 13, 2017 at 10:36:49PM +0000, Bart Van Assche wrote: > > On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote: > > > The one nit that I don't have an easy solution for is starting modules > > > depending on the device technology, eg not starting srp or ipoib on > > > !IB devices. The issue is that I can't think of an easy way to detect > > > the device technology from the udev rule, at least not without a > > > helper script or additional kernel sysfs.. > > > > Are users expected to modify the kernel-boot/modules/*.conf files? I > > expect that only a minority of the RDMA users will want to load the ib_srpt > > and ib_isert kernel modules. > > I'd say in the same way they would be expected to modify the RedHat > /etc/rdma/rdma.conf file. I copied the defaults from that file, which > was to autoload target drivers. > > I agree with you, they probably don't make sense to be defaulted on. > Ideally their respective supporting packages or kernel would ensure > the modules autoload if someone is working with RDMA... I know nothing > about the target infrastructure - can you see a way to do that? Hello Jason, Users who want to enable RDMA target functionality probably will know the names of the kernel modules that provide such functionality. On Debian systems the names of these kernel modules can e.g. be added to /etc/modules. However, I'm not sure all distro's have an equivalent of the /etc/modules file. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 13, 2017 at 10:45:14PM +0000, Bart Van Assche wrote: > Users who want to enable RDMA target functionality probably will know the > names of the kernel modules that provide such functionality. On Debian > systems the names of these kernel modules can e.g. be added to /etc/modules. > However, I'm not sure all distro's have an equivalent of the /etc/modules > file. systemd supports /etc/modules-load.d/ as a built in, some distros will symlink ../modules.conf to that directory, but these days dropping a file in /etc/modules-load.d/ is pretty safe. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jason, Am Donnerstag, den 13.07.2017, 16:15 -0600 schrieb Jason Gunthorpe: > On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote: > > On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote: > > > Here is an untested attempt to move module loading into > > > srp_daemon - > > > what do you think Bart? > > > > Hello Jason, > > > > Thanks for the patch. To me this looks like a nice simplification > > compared > > to the current approach. Do you want me to test this patch? > > I did some more steps.. > > Benjamin/Bart let me know what you think, I'll send the patches to > the > list if you like the idea. > > https://github.com/jgunthorpe/rdma-plumbing/tree/systemd > > This replaces Benjamins rdma.conf based approach. > > The basic approach is to directly use systemd to load the modules and > determine what modules to load by using udev and explicit Requires in > the units. I like your approach very much. It's cleaner and simpler. So let's ditch my approach and use yours. I tested your branch on a Debian 8 (jessie) system with a mlx4 card and found several points to fix/improve/discuss: 1. The rdma-load-modules@ service needs to set DefaultDependencies=no. Otherwise it will end up in a dependency loop: Found ordering cycle on rdma-load-modules@mlx4.service/start Found dependency on basic.target/start Found dependency on paths.target/start Found dependency on acpid.path/start Found dependency on sysinit.target/start Found dependency on rdma-load-modules@mlx4.service/start Breaking ordering cycle by deleting job paths.target/start Found ordering cycle on rdma-load-modules@mlx4.service/start Found dependency on paths.target/start Found dependency on acpid.path/start Found dependency on sysinit.target/start Found dependency on rdma-load-modules@mlx4.service/start Starting Load RDMA modules from /etc/rdma/modules/mlx4.conf... 2. You can just specify "etc/rdma/modules" in debian/rdma-core.install instead of listing each .conf file individually. 3. Default conf files for opa.conf and roce.conf are missing. 4. Should rdma-load-modules@ *not* fail if the corresponding .conf is missing? 5. How to handle build-in modules correctly? Our kernel has the i40e module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service will be started, but the system does not have a i40e card and thus I don't want to have the module started.
Am Freitag, den 14.07.2017, 12:08 +0200 schrieb Benjamin Drung: > Hi Jason, > > Am Donnerstag, den 13.07.2017, 16:15 -0600 schrieb Jason Gunthorpe: > > On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote: > > > On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote: > > > > Here is an untested attempt to move module loading into > > > > srp_daemon - > > > > what do you think Bart? > > > > > > Hello Jason, > > > > > > Thanks for the patch. To me this looks like a nice simplification > > > compared > > > to the current approach. Do you want me to test this patch? > > > > I did some more steps.. > > > > Benjamin/Bart let me know what you think, I'll send the patches to > > the > > list if you like the idea. > > > > https://github.com/jgunthorpe/rdma-plumbing/tree/systemd > > > > This replaces Benjamins rdma.conf based approach. > > > > The basic approach is to directly use systemd to load the modules > > and > > determine what modules to load by using udev and explicit Requires > > in > > the units. > > I like your approach very much. It's cleaner and simpler. So let's > ditch my approach and use yours. > > I tested your branch on a Debian 8 (jessie) system with a mlx4 card > and > found several points to fix/improve/discuss: > > 1. The rdma-load-modules@ service needs to set > DefaultDependencies=no. > Otherwise it will end up in a dependency loop: > > Found ordering cycle on rdma-load-modules@mlx4.service/start > Found dependency on basic.target/start > Found dependency on paths.target/start > Found dependency on acpid.path/start > Found dependency on sysinit.target/start > Found dependency on rdma-load-modules@mlx4.service/start > Breaking ordering cycle by deleting job paths.target/start > Found ordering cycle on rdma-load-modules@mlx4.service/start > Found dependency on paths.target/start > Found dependency on acpid.path/start > Found dependency on sysinit.target/start > Found dependency on rdma-load-modules@mlx4.service/start > Starting Load RDMA modules from /etc/rdma/modules/mlx4.conf... > > 2. You can just specify "etc/rdma/modules" in debian/rdma- > core.install > instead of listing each .conf file individually. > > 3. Default conf files for opa.conf and roce.conf are missing. > > 4. Should rdma-load-modules@ *not* fail if the corresponding .conf is > missing? > > 5. How to handle build-in modules correctly? Our kernel has the i40e > module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service > will > be started, but the system does not have a i40e card and thus I don't > want to have the module started. Forgot one point: 6. The ipoib module (loaded by rdma-load-modules@infiniband) needs to loaded before the networking.service is running. The networking.service brings up the network devices on Debian. It runs "ifup -a" which reads /etc/network/interfaces which we use to configure our ipoib devices. Here is the networking.service definition: bdrung@server:~$ systemctl cat networking.service # /run/systemd/generator.late/networking.service # Automatically generated by systemd-sysv-generator [Unit] SourcePath=/etc/init.d/networking Description=LSB: Raise network interfaces. DefaultDependencies=no Before=sysinit.target shutdown.target After=mountkernfs.service local-fs.target urandom.service Conflicts=shutdown.target [Service] Type=forking Restart=no TimeoutSec=0 IgnoreSIGPIPE=no KillMode=process GuessMainPID=no RemainAfterExit=yes SysVStartPriority=13 ExecStart=/etc/init.d/networking start ExecStop=/etc/init.d/networking stop ExecReload=/etc/init.d/networking reload # /run/systemd/generator/networking.service.d/50-insserv.conf-$network.conf # Automatically generated by systemd-insserv-generator [Unit] Wants=network.target Before=network.target # /lib/systemd/system/networking.service.d/network-pre.conf [Unit] After=network-pre.target systemd-sysctl.service systemd-modules-load.service
On Fri, Jul 14, 2017 at 12:08:41PM +0200, Benjamin Drung wrote: > I tested your branch on a Debian 8 (jessie) system with a mlx4 card and > found several points to fix/improve/discuss: Thanks! > 1. The rdma-load-modules@ service needs to set DefaultDependencies=no. > Otherwise it will end up in a dependency loop: Makes sense > 2. You can just specify "etc/rdma/modules" in debian/rdma-core.install > instead of listing each .conf file individually. The srp_daemon.conf is in the srp package, so I don't think I can do that? > 3. Default conf files for opa.conf and roce.conf are missing. Right.. Not sure if we should split them like this or not, depends if we can detect the card type at runtime. > 4. Should rdma-load-modules@ *not* fail if the corresponding .conf is > missing? Does it fail now? Failing seems like the right thing to do for a missing conf file. > 5. How to handle build-in modules correctly? Our kernel has the i40e > module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service will > be started, but the system does not have a i40e card and thus I don't > want to have the module started. Fixing this would require more fancy udev wonkery - I copied RH's tested approach which triggers on driver presence, not on driver binding. My udev is not great, but something like this: DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" Might work better? I think that triggers on driver bind? Could you try to switch your mlx and i40e drivers in that way? > 6. The ipoib module (loaded by rdma-load-modules@infiniband) needs to > loaded before the networking.service is running. The networking.service > brings up the network devices on Debian. It runs "ifup -a" which reads Hum. That LSB networking.service sure is an ugly hack, it doesn't support hotplug so it has this: After=network-pre.target systemd-sysctl.service systemd-modules-load.service To 'try' and run after some amount of hot plugging is done. IMHO this is done wrong, it should start after sysinit.target but before network-online.target or something... The only solution to this kind of problem is to add more ordering, Debian should include a patch to rdma-load-modules@ to put it before their unique networking.service.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, den 14.07.2017, 09:55 -0600 schrieb Jason Gunthorpe: > 2. You can just specify "etc/rdma/modules" in debian/rdma- > > core.install > > instead of listing each .conf file individually. > > The srp_daemon.conf is in the srp package, so I don't think I can do > that? Yes. You are right. So it needs to be listed individually. > > 4. Should rdma-load-modules@ *not* fail if the corresponding .conf > > is > > missing? > > Does it fail now? Failing seems like the right thing to do for a > missing conf file. Yes. Currently it fails in this situation. > > 5. How to handle build-in modules correctly? Our kernel has the > > i40e > > module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service > > will > > be started, but the system does not have a i40e card and thus I > > don't > > want to have the module started. > > Fixing this would require more fancy udev wonkery - I copied RH's > tested approach which triggers on driver presence, not on driver > binding. > > My udev is not great, but something like this: > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", > ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" > > Might work better? I think that triggers on driver bind? Could you > try > to switch your mlx and i40e drivers in that way? I tried it. It works as expected. rdma-load-modules@mlx4 is loaded and rdma-load-modules@i40e.service is not loaded. Not tested if the udev trigger will also work for used built-in modules. > > 6. The ipoib module (loaded by rdma-load-modules@infiniband) needs > > to > > loaded before the networking.service is running. The > > networking.service > > brings up the network devices on Debian. It runs "ifup -a" which > > reads > > Hum. That LSB networking.service sure is an ugly hack, it doesn't > support hotplug so it has this: > > After=network-pre.target systemd-sysctl.service systemd-modules- > load.service > > To 'try' and run after some amount of hot plugging is done. IMHO this > is done wrong, it should start after sysinit.target but before > network-online.target or something... > > The only solution to this kind of problem is to add more ordering, > Debian should include a patch to rdma-load-modules@ to put it before > their unique networking.service.. Or patch rdma-load-modules@ to put it before network-pre.target
On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote: > > My udev is not great, but something like this: > > > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" > > > > Might work better? I think that triggers on driver bind? Could you > > try to switch your mlx and i40e drivers in that way? > > I tried it. It works as expected. rdma-load-modules@mlx4 is loaded and > rdma-load-modules@i40e.service is not loaded. Not tested if the udev > trigger will also work for used built-in modules. Okay, I will change it to be like that and we can see what people think. > > The only solution to this kind of problem is to add more ordering, > > Debian should include a patch to rdma-load-modules@ to put it before > > their unique networking.service.. > > Or patch rdma-load-modules@ to put it before network-pre.target I think that would be OK. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, den 14.07.2017, 10:40 -0600 schrieb Jason Gunthorpe: > On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote: > > > > My udev is not great, but something like this: > > > > > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", > > > ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" > > > > > > Might work better? I think that triggers on driver bind? Could > > > you > > > try to switch your mlx and i40e drivers in that way? > > > > I tried it. It works as expected. rdma-load-modules@mlx4 is loaded > > and > > rdma-load-modules@i40e.service is not loaded. Not tested if the > > udev > > trigger will also work for used built-in modules. > > Okay, I will change it to be like that and we can see what people > think. Any updates?
On Thu, Jul 20, 2017 at 11:04:10AM +0200, Benjamin Drung wrote: > Am Freitag, den 14.07.2017, 10:40 -0600 schrieb Jason Gunthorpe: > > On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote: > > > > > > My udev is not great, but something like this: > > > > > > > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", > > > > ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" > > > > > > > > Might work better? I think that triggers on driver bind? Could > > > > you > > > > try to switch your mlx and i40e drivers in that way? > > > > > > I tried it. It works as expected.?rdma-load-modules@mlx4 is loaded > > > and? > > > rdma-load-modules@i40e.service is not loaded. Not tested if the > > > udev > > > trigger will also work for used built-in modules. > > > > Okay, I will change it to be like that and we can see what people > > think. > > Any updates? I've been doing some more testing here and it does not work entirely properly yet. The change to 'DRIVER==' causes things to fail, the kernel only generates uevents when kobjects are created, and binding a driver to a PCI device does not create a kobject. This means these lines: DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4" Do not work reliably. Depending on the module like before would still seem to be the best option. Also, the ordering before sysinit.target does not work reliably, udev loads modules asynchronously and systemd does not block on udev unless systemd-udevd-settle is depended on. I'm not sure how Debian's legacy networking.service manages to work at all, perhaps it is broken and it is just unlikely to be hit because of how quickly most ethernet drivers get loaded by udev. As far as I can tell, it needs to depend on systemd-udevd-settle to guarentee that all the cold-plug network drivers have been loaded by udev before starting.. .. all of these comments would seem to apply equally to the rdma.service approach, so I think this is still an improvement. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, I have a Debian 9 (stretch) system with a backported rdma-core 17.0-1 package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel modules) and following network configuration in /etc/network/interfaces: ``` auto ib0.dddd iface ib0.dddd inet6 static address fd44:1:5255:: netmask 64 pre-up echo connected > /sys/class/net/$IFACE/mode dad-attempts 600 auto ib1.dddd iface ib1.dddd inet6 static address fd44:2:5255:: ne tmask 64 pre-up echo connected > /sys/class/net/$IFACE/mode dad- attempts 600 ``` The terminal shows following ordering: ``` [FAILED] Failed to start Raise network interfaces. [ OK ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf [ OK ] Started Load RDMA modules from /etc/rdma/modules/infiniband.conf [ OK ] Reached target RDMA Hardware. ``` the networking.service fails with: ``` $ journalctl --no-host -u networking.service [...] Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib0.dddd/mode: Directory nonexistent Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib1.dddd/mode: Directory nonexistent Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd Mai 15 13:16:40 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE Mai 15 13:16:40 systemd[1]: Failed to start Raise network interfaces. Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed state. Mai 15 13:16:40 systemd[1]: networking.service: Failed with result 'exit-code'. ``` The networking.service fails because it tries to bring up ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service loads the ib_ipoib kernel module. networking.service declares that it should run after the network-pre.target and rdma-load-modules@infiniband.servi ce declares to run before network-pre.target. Therefore the order should be rdma-load-modules@infiniband.service -> network-pre.target -> networking.service, but this is obviously not the case. I am writing to this mailing list, because got stuck with debugging this issue and need your help.
On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote: > Hi, > > I have a Debian 9 (stretch) system with a backported rdma-core 17.0-1 > package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel > modules) and following network configuration in > /etc/network/interfaces: > > ``` > auto ib0.dddd > iface ib0.dddd inet6 static > address fd44:1:5255:: > netmask 64 > pre-up echo connected > /sys/class/net/$IFACE/mode > dad-attempts 600 > > auto ib1.dddd > iface ib1.dddd inet6 static > address fd44:2:5255:: > netmask 64 > pre-up echo connected > /sys/class/net/$IFACE/mode > dad-attempts 600 > ``` > > The terminal shows following ordering: > > ``` > [FAILED] Failed to start Raise network interfaces. > [ OK ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf > [ OK ] Started Load RDMA modules from /etc/rdma/modules/infiniband.conf > [ OK ] Reached target RDMA Hardware. > ``` > > the networking.service fails with: > ``` > $ journalctl --no-host -u networking.service > [...] > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib0.dddd/mode: Directory nonexistent > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib1.dddd/mode: Directory nonexistent > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd > Mai 15 13:16:40 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE > Mai 15 13:16:40 systemd[1]: Failed to start Raise network interfaces. > Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed state. > Mai 15 13:16:40 systemd[1]: networking.service: Failed with result 'exit-code'. > > ``` > > The networking.service fails because it tries to bring up > ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service loads > the ib_ipoib kernel module. networking.service declares that it should > run after the network-pre.target and rdma-load-modules@infiniband.servi > ce declares to run before network-pre.target. Therefore the order > should be rdma-load-modules@infiniband.service -> network-pre.target -> > networking.service, but this is obviously not the case. > > I am writing to this mailing list, because got stuck with debugging > this issue and need your help. The udev.md explains this: ## Interaction with legacy non-hotplug services Services that cannot handle hot plug must be ordered after systemd-udev-settle.service, which will wait for udev to complete loading modules and scheduling systemd services. This ensures that all RDMA hardware present at boot is setup before proceeding to run the legacy service. Admins using legacy services can also place their RDMA hardware modules (e.g. mlx4_ib) directly in /etc/modules-load.d/ or in their initrd which will cause systemd to defer passing to sysinit.target until all RDMA hardware is setup, this is usually sufficient for legacy services. This is probably the default behavior in many configurations. Since you see the backwards ordering and the errors it meands that ifupdown in stretch does not support hotplug. IMHO it is a bug in that package that it doesn't order after settle to try and avoid boot time hot plug events that it cannot handle. The modules solution is simplest, add ipoib and HCA drivers to modules.conf The robust and future looking solution is to use systemd-networkd instead of legacy ifupdown... It is a bit annoying today to get the connected setting though. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Dienstag, den 15.05.2018, 08:58 -0600 schrieb Jason Gunthorpe: > On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote: > > Hi, > > > > I have a Debian 9 (stretch) system with a backported rdma-core > > 17.0-1 > > package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel > > modules) and following network configuration in > > /etc/network/interfaces: > > > > ``` > > auto ib0.dddd > > iface ib0.dddd inet6 static > > address fd44:1:5255:: > > netmask 64 > > pre-up echo connected > /sys/class/net/$IFACE/mode > > dad-attempts 600 > > > > auto ib1.dddd > > iface ib1.dddd inet6 static > > address fd44:2:5255:: > > netmask 64 > > pre-up echo connected > /sys/class/net/$IFACE/mode > > dad-attempts 600 > > ``` > > > > The terminal shows following ordering: > > > > ``` > > [FAILED] Failed to start Raise network interfaces. > > [ OK ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf > > [ OK ] Started Load RDMA modules from > > /etc/rdma/modules/infiniband.conf > > [ OK ] Reached target RDMA Hardware. > > ``` > > > > the networking.service fails with: > > ``` > > $ journalctl --no-host -u networking.service > > [...] > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create > > /sys/class/net/ib0.dddd/mode: Directory nonexistent > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create > > /sys/class/net/ib1.dddd/mode: Directory nonexistent > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd > > Mai 15 13:16:40 systemd[1]: networking.service: Main process > > exited, code=exited, status=1/FAILURE > > Mai 15 13:16:40 systemd[1]: Failed to start Raise network > > interfaces. > > Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed > > state. > > Mai 15 13:16:40 systemd[1]: networking.service: Failed with result > > 'exit-code'. > > > > ``` > > > > The networking.service fails because it tries to bring up > > ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service > > loads > > the ib_ipoib kernel module. networking.service declares that it > > should > > run after the network-pre.target and rdma-load-modules@infiniband.s > > ervi > > ce declares to run before network-pre.target. Therefore the order > > should be rdma-load-modules@infiniband.service -> network- > > pre.target -> > > networking.service, but this is obviously not the case. > > > > I am writing to this mailing list, because got stuck with debugging > > this issue and need your help. > > The udev.md explains this: > > ## Interaction with legacy non-hotplug services > > Services that cannot handle hot plug must be ordered after > systemd-udev-settle.service, which will wait for udev to complete > loading > modules and scheduling systemd services. This ensures that all RDMA > hardware > present at boot is setup before proceeding to run the legacy > service. > > Admins using legacy services can also place their RDMA hardware > modules > (e.g. mlx4_ib) directly in /etc/modules-load.d/ or in their initrd > which will > cause systemd to defer passing to sysinit.target until all RDMA > hardware is > setup, this is usually sufficient for legacy services. This is > probably the > default behavior in many configurations. > > Since you see the backwards ordering and the errors it meands that > ifupdown in stretch does not support hotplug. IMHO it is a bug in > that > package that it doesn't order after settle to try and avoid boot time > hot plug events that it cannot handle. > > The modules solution is simplest, add ipoib and HCA drivers to > modules.conf I added the systemd-udev-settle.service dependency: ``` $ systemctl cat networking.service # /lib/systemd/system/networking.service [Unit] Description=Raise network interfaces Documentation=man:interfaces(5) DefaultDependencies=no Wants=network.target After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service Before=network.target shutdown.target network-online.target Conflicts=shutdown.target [Install] WantedBy=multi-user.target WantedBy=network-online.target [Service] Type=oneshot EnvironmentFile=-/etc/default/networking ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle' ExecStart=/sbin/ifup -a --read-environment ExecStop=/sbin/ifdown -a --read-environment --exclude=lo RemainAfterExit=true TimeoutStartSec=5min # /etc/systemd/system/networking.service.d/rdma.conf [Unit] # See https://marc.info/?l=linux-rdma&m=152639629213650&w=2 After=systemd-udev-settle.service ``` but it is still not working (same error messages).
On Tue, May 15, 2018 at 06:10:54PM +0200, Benjamin Drung wrote: > I added the systemd-udev-settle.service dependency: > > ``` > $ systemctl cat networking.service > # /lib/systemd/system/networking.service > [Unit] > Description=Raise network interfaces > Documentation=man:interfaces(5) > DefaultDependencies=no > Wants=network.target > After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service > Before=network.target shutdown.target network-online.target > Conflicts=shutdown.target > > [Install] > WantedBy=multi-user.target > WantedBy=network-online.target > > [Service] > Type=oneshot > EnvironmentFile=-/etc/default/networking > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle' > ExecStart=/sbin/ifup -a --read-environment > ExecStop=/sbin/ifdown -a --read-environment --exclude=lo > RemainAfterExit=true > TimeoutStartSec=5min > > # /etc/systemd/system/networking.service.d/rdma.conf > [Unit] > # See https://marc.info/?l=linux-rdma&m=152639629213650&w=2 > After=systemd-udev-settle.service > ``` > > but it is still not working (same error messages). I think it needs a Wants as well ? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-05-15 at 18:10 +0200, Benjamin Drung wrote: > Am Dienstag, den 15.05.2018, 08:58 -0600 schrieb Jason Gunthorpe: > > On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote: > > > Hi, > > > > > > I have a Debian 9 (stretch) system with a backported rdma-core > > > 17.0-1 > > > package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel > > > modules) and following network configuration in > > > /etc/network/interfaces: > > > > > > ``` > > > auto ib0.dddd > > > iface ib0.dddd inet6 static > > > address fd44:1:5255:: > > > netmask 64 > > > pre-up echo connected > /sys/class/net/$IFACE/mode > > > dad-attempts 600 > > > > > > auto ib1.dddd > > > iface ib1.dddd inet6 static > > > address fd44:2:5255:: > > > netmask 64 > > > pre-up echo connected > /sys/class/net/$IFACE/mode > > > dad-attempts 600 > > > ``` > > > > > > The terminal shows following ordering: > > > > > > ``` > > > [FAILED] Failed to start Raise network interfaces. > > > [ OK ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf > > > [ OK ] Started Load RDMA modules from > > > /etc/rdma/modules/infiniband.conf > > > [ OK ] Reached target RDMA Hardware. > > > ``` > > > > > > the networking.service fails with: > > > ``` > > > $ journalctl --no-host -u networking.service > > > [...] > > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create > > > /sys/class/net/ib0.dddd/mode: Directory nonexistent > > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd > > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create > > > /sys/class/net/ib1.dddd/mode: Directory nonexistent > > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd > > > Mai 15 13:16:40 systemd[1]: networking.service: Main process > > > exited, code=exited, status=1/FAILURE > > > Mai 15 13:16:40 systemd[1]: Failed to start Raise network > > > interfaces. > > > Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed > > > state. > > > Mai 15 13:16:40 systemd[1]: networking.service: Failed with result > > > 'exit-code'. > > > > > > ``` > > > > > > The networking.service fails because it tries to bring up > > > ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service > > > loads > > > the ib_ipoib kernel module. networking.service declares that it > > > should > > > run after the network-pre.target and rdma-load-modules@infiniband.s > > > ervi > > > ce declares to run before network-pre.target. Therefore the order > > > should be rdma-load-modules@infiniband.service -> network- > > > pre.target -> > > > networking.service, but this is obviously not the case. > > > > > > I am writing to this mailing list, because got stuck with debugging > > > this issue and need your help. > > > > The udev.md explains this: > > > > ## Interaction with legacy non-hotplug services > > > > Services that cannot handle hot plug must be ordered after > > systemd-udev-settle.service, which will wait for udev to complete > > loading > > modules and scheduling systemd services. This ensures that all RDMA > > hardware > > present at boot is setup before proceeding to run the legacy > > service. > > > > Admins using legacy services can also place their RDMA hardware > > modules > > (e.g. mlx4_ib) directly in /etc/modules-load.d/ or in their initrd > > which will > > cause systemd to defer passing to sysinit.target until all RDMA > > hardware is > > setup, this is usually sufficient for legacy services. This is > > probably the > > default behavior in many configurations. > > > > Since you see the backwards ordering and the errors it meands that > > ifupdown in stretch does not support hotplug. IMHO it is a bug in > > that > > package that it doesn't order after settle to try and avoid boot time > > hot plug events that it cannot handle. > > > > The modules solution is simplest, add ipoib and HCA drivers to > > modules.conf > > I added the systemd-udev-settle.service dependency: > > ``` > $ systemctl cat networking.service > # /lib/systemd/system/networking.service > [Unit] > Description=Raise network interfaces > Documentation=man:interfaces(5) > DefaultDependencies=no > Wants=network.target > After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service > Before=network.target shutdown.target network-online.target > Conflicts=shutdown.target > > [Install] > WantedBy=multi-user.target > WantedBy=network-online.target > > [Service] > Type=oneshot > EnvironmentFile=-/etc/default/networking > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle' I wouldn't trust that you can run udevadm settle here and get the right results. This will only wait for the current udev hotplug events to complete. It won't necessarily wait for any unstarted hotplug events. I think you need to change the After line above to include the systemd- udev-settle.service directly. > ExecStart=/sbin/ifup -a --read-environment > ExecStop=/sbin/ifdown -a --read-environment --exclude=lo > RemainAfterExit=true > TimeoutStartSec=5min > > # /etc/systemd/system/networking.service.d/rdma.conf > [Unit] > # See https://marc.info/?l=linux-rdma&m=152639629213650&w=2 > After=systemd-udev-settle.service > ``` > > but it is still not working (same error messages). >
On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > I added the systemd-udev-settle.service dependency: > > > > ``` > > $ systemctl cat networking.service > > # /lib/systemd/system/networking.service > > [Unit] > > Description=Raise network interfaces > > Documentation=man:interfaces(5) > > DefaultDependencies=no > > Wants=network.target > > After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service > > Before=network.target shutdown.target network-online.target > > Conflicts=shutdown.target > > > > [Install] > > WantedBy=multi-user.target > > WantedBy=network-online.target > > > > [Service] > > Type=oneshot > > EnvironmentFile=-/etc/default/networking > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle' > > I wouldn't trust that you can run udevadm settle here and get the right > results. This will only wait for the current udev hotplug events to > complete. Oh, neat, so udev settle is already called by Debian's networking.service (as it should be) - assuming CONFIGURE_INTERFACES is set, and whatever that other stuff does (Ben is this triggering for you?) If this is already happening inthis probably means you have it right and the udev hotplug cycle with RDMA is even too async for 'udev settle'?? Is it because we launch the module loads from system? Presumably if it was internal it wouldn't break the hotplug cycle? If this is the case we might have to replace the systemd based loader with some kind of udev builtin loader :\ > It won't necessarily wait for any unstarted hotplug events. > I think you need to change the After line above to include the systemd- > udev-settle.service directly. It is OK to have multiple Afters, systemd adds to the list in this case. This is how the add-ins are supposed to work.. But the whole thing isn't neeed if the 'built-in' settle of networking.service can be used. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe: > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > > I added the systemd-udev-settle.service dependency: > > > > > > ``` > > > $ systemctl cat networking.service > > > # /lib/systemd/system/networking.service > > > [Unit] > > > Description=Raise network interfaces > > > Documentation=man:interfaces(5) > > > DefaultDependencies=no > > > Wants=network.target > > > After=local-fs.target network-pre.target apparmor.service > > > systemd-sysctl.service systemd-modules-load.service > > > Before=network.target shutdown.target network-online.target > > > Conflicts=shutdown.target > > > > > > [Install] > > > WantedBy=multi-user.target > > > WantedBy=network-online.target > > > > > > [Service] > > > Type=oneshot > > > EnvironmentFile=-/etc/default/networking > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && > > > udevadm settle' > > > > I wouldn't trust that you can run udevadm settle here and get the > > right > > results. This will only wait for the current udev hotplug events > > to > > complete. > > Oh, neat, so udev settle is already called by Debian's > networking.service (as it should be) - assuming CONFIGURE_INTERFACES > is set, and whatever that other stuff does (Ben is this triggering > for you?) I should have looked more closely at the service file (I didn't notice the udevadm settle in there). CONFIGURE_INTERFACES is not set in /etc/default/networking and ifquery returns a bunch of interfaces. Therefore 'udevadm settle' is executed. I tried to debug it further by injecting commands to the pre-up hook. When pre-up runs: * lsmod shows that ib_ipoib is loaded * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are present To me it looks like a race condition between populating /sys/class/net/ibX after loading ib_ipoib and the networking service. Do you have a suggestion how to address this? We are using Mellanox OFED on the affected hosts. The mainline ipoib is not affected. Are the commits that are related to this and that we should cherry-pick?
On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote: > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe: > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > > > I added the systemd-udev-settle.service dependency: > > > > > > > > ``` > > > > $ systemctl cat networking.service > > > > # /lib/systemd/system/networking.service > > > > [Unit] > > > > Description=Raise network interfaces > > > > Documentation=man:interfaces(5) > > > > DefaultDependencies=no > > > > Wants=network.target > > > > After=local-fs.target network-pre.target apparmor.service > > > > systemd-sysctl.service systemd-modules-load.service > > > > Before=network.target shutdown.target network-online.target > > > > Conflicts=shutdown.target > > > > > > > > [Install] > > > > WantedBy=multi-user.target > > > > WantedBy=network-online.target > > > > > > > > [Service] > > > > Type=oneshot > > > > EnvironmentFile=-/etc/default/networking > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && > > > > udevadm settle' > > > > > > I wouldn't trust that you can run udevadm settle here and get the > > > right > > > results. This will only wait for the current udev hotplug events > > > to > > > complete. > > > > Oh, neat, so udev settle is already called by Debian's > > networking.service (as it should be) - assuming CONFIGURE_INTERFACES > > is set, and whatever that other stuff does (Ben is this triggering > > for you?) > > I should have looked more closely at the service file (I didn't notice > the udevadm settle in there). CONFIGURE_INTERFACES is not set in > /etc/default/networking and ifquery returns a bunch of interfaces. > Therefore 'udevadm settle' is executed. > > I tried to debug it further by injecting commands to the pre-up hook. > When pre-up runs: > > * lsmod shows that ib_ipoib is loaded > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are present > > To me it looks like a race condition between populating > /sys/class/net/ibX after loading ib_ipoib and the networking > service. Is the rdma device present at this point? eg sys/class/infiniband ? Is any systemd-modules-load processes still running? Are the mlx IB modules loaded? > Do you have a suggestion how to address this? We are using Mellanox > OFED on the affected hosts. The mainline ipoib is not affected. Are the > commits that are related to this and that we should cherry-pick? Oh, mainline ipoib works? Great. I have no idea what is in Mellanox OFED, sorry.. I don't think this is anything that was fixed in mainline. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe: > On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote: > > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe: > > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > > > > I added the systemd-udev-settle.service dependency: > > > > > > > > > > ``` > > > > > $ systemctl cat networking.service > > > > > # /lib/systemd/system/networking.service > > > > > [Unit] > > > > > Description=Raise network interfaces > > > > > Documentation=man:interfaces(5) > > > > > DefaultDependencies=no > > > > > Wants=network.target > > > > > After=local-fs.target network-pre.target apparmor.service > > > > > systemd-sysctl.service systemd-modules-load.service > > > > > Before=network.target shutdown.target network-online.target > > > > > Conflicts=shutdown.target > > > > > > > > > > [Install] > > > > > WantedBy=multi-user.target > > > > > WantedBy=network-online.target > > > > > > > > > > [Service] > > > > > Type=oneshot > > > > > EnvironmentFile=-/etc/default/networking > > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] > > > > > && > > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && > > > > > udevadm settle' > > > > > > > > I wouldn't trust that you can run udevadm settle here and get > > > > the > > > > right > > > > results. This will only wait for the current udev hotplug > > > > events > > > > to > > > > complete. > > > > > > Oh, neat, so udev settle is already called by Debian's > > > networking.service (as it should be) - assuming > > > CONFIGURE_INTERFACES > > > is set, and whatever that other stuff does (Ben is this > > > triggering > > > for you?) > > > > I should have looked more closely at the service file (I didn't > > notice > > the udevadm settle in there). CONFIGURE_INTERFACES is not set in > > /etc/default/networking and ifquery returns a bunch of interfaces. > > Therefore 'udevadm settle' is executed. > > > > I tried to debug it further by injecting commands to the pre-up > > hook. > > When pre-up runs: > > > > * lsmod shows that ib_ipoib is loaded > > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are > > present > > > > To me it looks like a race condition between populating > > /sys/class/net/ibX after loading ib_ipoib and the networking > > service. > > Is the rdma device present at this point? eg sys/class/infiniband ? /sys/class/infiniband/mlx4_0 is present. > Is any systemd-modules-load processes still running? '/lib/systemd/systemd-modules-load /etc/rdma/modules/infiniband.conf' is still running. > Are the mlx IB modules loaded? Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to lsmod). The first two modules are already loaded in the initrd. Also ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink are loaded.
On Fri, May 18, 2018 at 11:26:07AM +0200, Benjamin Drung wrote: > Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe: > > On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote: > > > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe: > > > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > > > > > I added the systemd-udev-settle.service dependency: > > > > > > > > > > > > ``` > > > > > > $ systemctl cat networking.service > > > > > > # /lib/systemd/system/networking.service > > > > > > [Unit] > > > > > > Description=Raise network interfaces > > > > > > Documentation=man:interfaces(5) > > > > > > DefaultDependencies=no > > > > > > Wants=network.target > > > > > > After=local-fs.target network-pre.target apparmor.service > > > > > > systemd-sysctl.service systemd-modules-load.service > > > > > > Before=network.target shutdown.target network-online.target > > > > > > Conflicts=shutdown.target > > > > > > > > > > > > [Install] > > > > > > WantedBy=multi-user.target > > > > > > WantedBy=network-online.target > > > > > > > > > > > > [Service] > > > > > > Type=oneshot > > > > > > EnvironmentFile=-/etc/default/networking > > > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] > > > > > > && > > > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && > > > > > > udevadm settle' > > > > > > > > > > I wouldn't trust that you can run udevadm settle here and get > > > > > the > > > > > right > > > > > results. This will only wait for the current udev hotplug > > > > > events > > > > > to > > > > > complete. > > > > > > > > Oh, neat, so udev settle is already called by Debian's > > > > networking.service (as it should be) - assuming > > > > CONFIGURE_INTERFACES > > > > is set, and whatever that other stuff does (Ben is this > > > > triggering > > > > for you?) > > > > > > I should have looked more closely at the service file (I didn't > > > notice > > > the udevadm settle in there). CONFIGURE_INTERFACES is not set in > > > /etc/default/networking and ifquery returns a bunch of interfaces. > > > Therefore 'udevadm settle' is executed. > > > > > > I tried to debug it further by injecting commands to the pre-up > > > hook. > > > When pre-up runs: > > > > > > * lsmod shows that ib_ipoib is loaded > > > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are > > > present > > > > > > To me it looks like a race condition between populating > > > /sys/class/net/ibX after loading ib_ipoib and the networking > > > service. > > > > Is the rdma device present at this point? eg sys/class/infiniband ? > > /sys/class/infiniband/mlx4_0 is present. > > > Is any systemd-modules-load processes still running? > > '/lib/systemd/systemd-modules-load /etc/rdma/modules/infiniband.conf' > is still running. > > Are the mlx IB modules loaded? > > Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to > lsmod). The first two modules are already loaded in the initrd. Also > ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink are > loaded. Hmm, that is very mysterious, then, I can't think how systemd-modules-load could still be running at this point. If you load the ib driver in initrd then the above should have been scheduled very early in boot, and it has a Before=network-pre.target which should delay networking.service from starting while it is running. What does the logging say about when rdma-load-modules was started and was the IB device created before the initrd device exited? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, den 18.05.2018, 09:35 -0600 schrieb Jason Gunthorpe: > On Fri, May 18, 2018 at 11:26:07AM +0200, Benjamin Drung wrote: > > Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe: > > > On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote: > > > > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason > > > > Gunthorpe: > > > > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote: > > > > > > > I added the systemd-udev-settle.service dependency: > > > > > > > > > > > > > > ``` > > > > > > > $ systemctl cat networking.service > > > > > > > # /lib/systemd/system/networking.service > > > > > > > [Unit] > > > > > > > Description=Raise network interfaces > > > > > > > Documentation=man:interfaces(5) > > > > > > > DefaultDependencies=no > > > > > > > Wants=network.target > > > > > > > After=local-fs.target network-pre.target apparmor.service > > > > > > > systemd-sysctl.service systemd-modules-load.service > > > > > > > Before=network.target shutdown.target network- > > > > > > > online.target > > > > > > > Conflicts=shutdown.target > > > > > > > > > > > > > > [Install] > > > > > > > WantedBy=multi-user.target > > > > > > > WantedBy=network-online.target > > > > > > > > > > > > > > [Service] > > > > > > > Type=oneshot > > > > > > > EnvironmentFile=-/etc/default/networking > > > > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != > > > > > > > "no" ] > > > > > > > && > > > > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" > > > > > > > ] && > > > > > > > udevadm settle' > > > > > > > > > > > > I wouldn't trust that you can run udevadm settle here and > > > > > > get > > > > > > the > > > > > > right > > > > > > results. This will only wait for the current udev hotplug > > > > > > events > > > > > > to > > > > > > complete. > > > > > > > > > > Oh, neat, so udev settle is already called by Debian's > > > > > networking.service (as it should be) - assuming > > > > > CONFIGURE_INTERFACES > > > > > is set, and whatever that other stuff does (Ben is this > > > > > triggering > > > > > for you?) > > > > > > > > I should have looked more closely at the service file (I didn't > > > > notice > > > > the udevadm settle in there). CONFIGURE_INTERFACES is not set > > > > in > > > > /etc/default/networking and ifquery returns a bunch of > > > > interfaces. > > > > Therefore 'udevadm settle' is executed. > > > > > > > > I tried to debug it further by injecting commands to the pre-up > > > > hook. > > > > When pre-up runs: > > > > > > > > * lsmod shows that ib_ipoib is loaded > > > > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are > > > > present > > > > > > > > To me it looks like a race condition between populating > > > > /sys/class/net/ibX after loading ib_ipoib and the networking > > > > service. > > > > > > Is the rdma device present at this point? eg sys/class/infiniband > > > ? > > > > /sys/class/infiniband/mlx4_0 is present. > > > > > Is any systemd-modules-load processes still running? > > > > '/lib/systemd/systemd-modules-load > > /etc/rdma/modules/infiniband.conf' > > is still running. > > > Are the mlx IB modules loaded? > > > > Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to > > lsmod). The first two modules are already loaded in the initrd. > > Also > > ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink > > are > > loaded. > > Hmm, that is very mysterious, then, I can't think how systemd- > modules-load > could still be running at this point. > > If you load the ib driver in initrd then the above should have been > scheduled very early in boot, and it has a Before=network-pre.target > which should delay networking.service from starting while it is > running. > > What does the logging say about when rdma-load-modules was started > and > was the IB device created before the initrd device exited? I opened a bug report against systemd in Debian: https://bugs.debian.org /899002 Then I tried to implement a workaround (which does not work): $ cat /etc/systemd/system/networking.service.d/rdma.conf [Service] # Work around systemd bug https://bugs.debian.org/899002 # See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2 ExecStartPre=/bin/ps auxff ExecStartPre=/bin/ls -l /sys/class/infiniband ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.service ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules-load); do echo "Waiting for systemd-modules-load process $pid to exit..."; tail --pid=$pid -f /dev/null; done' systemctl status says that rdma-load-modules@infiniband.service was started one second after networking.service. The ps command from ExecStartPre says that only systemd-journald, systemd-udevd, multipathd, and init were running. "ls -l /sys/class/infiniband" says that mlx4_0 is present. And "systemctl status rdma-load-modules@infiniband.service" says: rdma-load-modules@infiniband.service - Load RDMA modules from /etc/rdma/modules/infiniband.conf Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service; static; vendor preset: enabled) Active: inactive (dead) Docs: file:/usr/share/doc/rdma-core/udev.md So it is clear, that rdma-load-modules@infiniband.service is not triggered when networking.service is started.
On Fri, May 18, 2018 at 06:22:12PM +0200, Benjamin Drung wrote: > > Hmm, that is very mysterious, then, I can't think how systemd- > > modules-load > > could still be running at this point. > > > > If you load the ib driver in initrd then the above should have been > > scheduled very early in boot, and it has a Before=network-pre.target > > which should delay networking.service from starting while it is > > running. > > > > What does the logging say about when rdma-load-modules was started > > and > > was the IB device created before the initrd device exited? > > I opened a bug report against systemd in Debian: > https://bugs.debian.org/899002 > > Then I tried to implement a workaround (which does not work): > > $ cat /etc/systemd/system/networking.service.d/rdma.conf > [Service] > # Work around systemd bug https://bugs.debian.org/899002 > # See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2 > ExecStartPre=/bin/ps auxff > ExecStartPre=/bin/ls -l /sys/class/infiniband > ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.service > ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules-load); do echo "Waiting for systemd-modules-load process $pid to exit..."; tail --pid=$pid -f /dev/null; done' > > systemctl status says that rdma-load-modules@infiniband.service was > started one second after networking.service. > > The ps command from ExecStartPre says that only systemd-journald, > systemd-udevd, multipathd, and init were running. "ls -l > /sys/class/infiniband" says that mlx4_0 is present. And "systemctl > status rdma-load-modules@infiniband.service" says: > > rdma-load-modules@infiniband.service - Load RDMA modules from /etc/rdma/modules/infiniband.conf > Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service; static; vendor preset: enabled) > Active: inactive (dead) > Docs: file:/usr/share/doc/rdma-core/udev.md > > So it is clear, that rdma-load-modules@infiniband.service is not > triggered when networking.service is started. Hum, if you have the modules in the initrd then udev should schedule this service to run essentially immediately on boot, and it should become ordered properly.. Ie the rdma device should already present when udev is started. Starting *after* networking.service suggests that the mlx4 RDMA device was hotplugged into the system a long time after early boot! Which is not at all what I expect. What does dmesg say about the mlx4 driver load? Upstream blocks module completion until the driver is done (this takes a long time), is it possible that MOFED does this async? That could explain everything. Also, IMHO, the networking.service above is wrong. It should not attempt to do udevadm settle internally, but it must depend on systemd-udev-settle.service. The reason is due to how systemd scheduals ordering. Once it starts running networking.service 'ExecStartPre' it will not re-consider order past that point. So any activations done by udev while settling have no impact on networking.service at all. Having it depend on systemd-udev-settle.service means it gets to recheck ordering after settle is done, but before starting networking.sevice - which is the behavior it is really trying to get. That may be a big part of this bug, go back to doing: After=systemd-udev-settle.service Requires=systemd-udev-settle.service Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, den 18.05.2018, 11:31 -0600 schrieb Jason Gunthorpe: > On Fri, May 18, 2018 at 06:22:12PM +0200, Benjamin Drung wrote: > > > Hmm, that is very mysterious, then, I can't think how systemd- > > > modules-load > > > could still be running at this point. > > > > > > If you load the ib driver in initrd then the above should have > > > been > > > scheduled very early in boot, and it has a Before=network- > > > pre.target > > > which should delay networking.service from starting while it is > > > running. > > > > > > What does the logging say about when rdma-load-modules was > > > started > > > and > > > was the IB device created before the initrd device exited? > > > > I opened a bug report against systemd in Debian: > > https://bugs.debian.org/899002 > > > > Then I tried to implement a workaround (which does not work): > > > > $ cat /etc/systemd/system/networking.service.d/rdma.conf > > [Service] > > # Work around systemd bug https://bugs.debian.org/899002 > > # See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2 > > ExecStartPre=/bin/ps auxff > > ExecStartPre=/bin/ls -l /sys/class/infiniband > > ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.ser > > vice > > ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules- > > load); do echo "Waiting for systemd-modules-load process $pid to > > exit..."; tail --pid=$pid -f /dev/null; done' > > > > systemctl status says that rdma-load-modules@infiniband.service was > > started one second after networking.service. > > > > The ps command from ExecStartPre says that only systemd-journald, > > systemd-udevd, multipathd, and init were running. "ls -l > > /sys/class/infiniband" says that mlx4_0 is present. And "systemctl > > status rdma-load-modules@infiniband.service" says: > > > > rdma-load-modules@infiniband.service - Load RDMA modules from > > /etc/rdma/modules/infiniband.conf > > Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service; > > static; vendor preset: enabled) > > Active: inactive (dead) > > Docs: file:/usr/share/doc/rdma-core/udev.md > > > > So it is clear, that rdma-load-modules@infiniband.service is not > > triggered when networking.service is started. > > Hum, if you have the modules in the initrd then udev should schedule > this service to run essentially immediately on boot, and it should > become ordered properly.. > > Ie the rdma device should already present when udev is started. > > Starting *after* networking.service suggests that the mlx4 RDMA > device > was hotplugged into the system a long time after early boot! Which is > not at all what I expect. > > What does dmesg say about the mlx4 driver load? I booted with break=bottom and listed the loaded modules in the initrd. They were: mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink mlx4_core mlx_compat > Upstream blocks module completion until the driver is done (this > takes > a long time), is it possible that MOFED does this async? That could > explain everything. > > Also, IMHO, the networking.service above is wrong. It should not > attempt to do udevadm settle internally, but it must depend on > systemd-udev-settle.service. > > The reason is due to how systemd scheduals ordering. Once it starts > running networking.service 'ExecStartPre' it will not re-consider > order past that point. So any activations done by udev while settling > have no impact on networking.service at all. > > Having it depend on systemd-udev-settle.service means it gets to > recheck ordering after settle is done, but before starting > networking.sevice - which is the behavior it is really trying to get. > > That may be a big part of this bug, go back to doing: > > After=systemd-udev-settle.service > Requires=systemd-udev-settle.service You are right. I modified networking.service accordingly and it works as expected now. I send a patch for ifupdown to Debian, but a discussion about the fix started: https://bugs.debian.org/899002
On Tue, May 22, 2018 at 03:23:47PM +0200, Benjamin Drung wrote: > > Also, IMHO, the networking.service above is wrong. It should not > > attempt to do udevadm settle internally, but it must depend on > > systemd-udev-settle.service. I'm still a little puzzled here, as ipoib should have been started really, really early on, early enough to get ordered. Can you check the boot log dmesgs to see when the banner for plugging in the mlx4 IB device is printed? I fear that is being done after the initrd is finished with MOFED. > > The reason is due to how systemd scheduals ordering. Once it starts > > running networking.service 'ExecStartPre' it will not re-consider > > order past that point. So any activations done by udev while settling > > have no impact on networking.service at all. > > > > Having it depend on systemd-udev-settle.service means it gets to > > recheck ordering after settle is done, but before starting > > networking.sevice - which is the behavior it is really trying to get. > > > > That may be a big part of this bug, go back to doing: > > > > After=systemd-udev-settle.service > > Requires=systemd-udev-settle.service > > You are right. I modified networking.service accordingly and it works > as expected now. I send a patch for ifupdown to Debian, but a > discussion about the fix started: https://bugs.debian.org/899002 That bug discussion has taken a strange turn.. Firstly, the entire point of 'udev adm settle' is to create compatibility with non-hotplug aware things like ifupdown by synchronizing the cold plug of all the boot devices. So it is wrong to say that this is somehow the fault of the rdma scripts - they should work correctly with the settle mechanism. I can understand the ifupdown not wanting to always depend on settle, but their really is no choice. If thier scripts don't support hot plug then they have to settle before running them or lots of stuff just won't work right. I think the compromise for them should be to keep the settle as optional, but implement it correctly. Something like this helper service: ifupdown-pre.service: [Unit] Description=Helper to synchronize boot up for ifupdown Wants=network.target After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service Before=network.target shutdown.target network-online.target Conflicts=shutdown.target [Service] Type=oneshot ExectStart=/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && systemctl start systemd-udev-settle.service They could compromize and put the optional settled they already have in their own script (like Then change networking.service to have: # Launch the oneshot to conditionally start ssytemd-udev-settle.service before starting this Requires=ifupdown-pre.service After=ifupdown-pre.service # If it did lauch systemd-udev-settle.service then wait for it before starting After=systemd-udev-settle.service And obviously drop the wrong ExecStartPre from networking.service. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/srp_daemon/srp_daemon.service.in b/srp_daemon/srp_daemon.service.in index 33dddd5cb46fef..cca1fce9c99283 100644 --- a/srp_daemon/srp_daemon.service.in +++ b/srp_daemon/srp_daemon.service.in @@ -1,6 +1,6 @@ [Unit] Description=Daemon that discovers and logs in to SRP target systems -Documentation=man:srp_daemon file:/etc/rdma/rdma.conf file:/etc/srp_daemon.conf +Documentation=man:srp_daemon file:/etc/srp_daemon.conf DefaultDependencies=false Conflicts=emergency.target emergency.service Before=remote-fs-pre.target diff --git a/srp_daemon/srp_daemon_port@.service.in b/srp_daemon/srp_daemon_port@.service.in index 0ec966f912aec8..3c9c824fd243aa 100644 --- a/srp_daemon/srp_daemon_port@.service.in +++ b/srp_daemon/srp_daemon_port@.service.in @@ -3,6 +3,7 @@ Description=SRP daemon that monitors port %i Documentation=man:srp_daemon file:/etc/rdma/rdma.conf file:/etc/srp_daemon.conf DefaultDependencies=false Conflicts=emergency.target emergency.service +Requires=srp_kernel_module.service After=srp_daemon.service dev-infiniband-umad-%i.device network.target BindsTo=srp_daemon.service dev-infiniband-umad-%i.device Before=remote-fs-pre.target diff --git a/srp_daemon/srp_kernel_module.service b/srp_daemon/srp_kernel_module.service new file mode 100644 index 00000000000000..b779031578dae1 --- /dev/null +++ b/srp_daemon/srp_kernel_module.service @@ -0,0 +1,12 @@ +[Unit] +Description=Load the SRP daemon kernel module +Documentation=man:srp_daemon +After=systemd-modules-load.service +ConditionCapability=CAP_SYS_MODULE +ConditionPathIsDirectory=!/sys/class/infiniband_srp/ + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecPre=bash -c 'mkdir -p /run/modules-load.d && echo ib_srp > /run/modules-load.d/rdma_core_srp_modules.conf' +ExecStart=/lib/systemd/systemd-modules-load