Fine folks of c/selfhosted, I've got a Docker LXC (Debian) running in Proxmox that loses its local network connection 24 hours after boot. It's remedied with a LXC restart. I am still able to access the console through Proxmox when this happens, but all running services (docker ps still says they're running) are inaccessible on the network. Any recommendations for an inexperienced selfhoster like myself to keep this thing up for more than 24 hours?
Tried:
- Pruning everything from Docker in case it was a remnant of an old container or something.
- Confirming network config on the router wasn't breaking anything.
- Checked there were no cron tasks doing funky things.
I did have a Watchtower container running on it recently, but have since removed it. It being a 24 hr thing got me thinking that was the only thing that would really cause an event at the 24 hr post start mark, and it started about that same time I removed Watchtower (intending to do manual updates because immich).
...and of course, any fix needs 24 hours to confirm it actually worked.
A forum post I found asked for the output of ip a and ip r, ~~see below.~~ Notable difference on ip r missing the link to the gateway after disconnecting.
Update: started going through journalctl and found the below abnormal entries when it loses connection, now investigating to see if I can find out why...
Apr 16 14:09:16 docker 922abd47b5c5[376]: [msg] Nameserver 1.1.1.1:53 has failed: request timed out.
Apr 16 14:09:16 docker 922abd47b5c5[376]: [msg] Nameserver 192.168.1.5:53 has failed: request timed out.
Apr 16 14:09:16 docker 922abd47b5c5[376]: [msg] All nameservers have failed
Update 2: I found using systemctl status networking.service
that networking.service was in a failed state (Active: failed (Result: exit-code)). I also compared to a separate stable Docker LXC which showed networking.service was active, so, did some searching to remedy that.
x networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2024-04-16 17:17:41 CST; 8min ago
Docs: man:interfaces(5)
Process: 20892 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 21124 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 20892 (code=exited, status=1/FAILURE)
CPU: 297ms
Apr 16 17:17:34 docker dhclient[20901]: DHCPACK of 192.168.1.104 from 192.168.1.1
Apr 16 17:17:34 docker ifup[20901]: DHCPACK of 192.168.1.104 from 192.168.1.1
Apr 16 17:17:34 docker ifup[20910]: RTNETLINK answers: File exists
Apr 16 17:17:34 docker dhclient[20901]: bound to 192.168.1.104 -- renewal in 37359 seconds.
Apr 16 17:17:34 docker ifup[20901]: bound to 192.168.1.104 -- renewal in 37359 seconds.
Apr 16 17:17:41 docker ifup[20966]: Could not get a link-local address
Apr 16 17:17:41 docker ifup[20892]: ifup: failed to bring up eth0
Apr 16 17:17:41 docker systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 16 17:17:41 docker systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 16 17:17:41 docker systemd[1]: Failed to start networking.service - Raise network interfaces.
A reinstall of net-tools and ifupdown seems to have brought networking.service back up. apt-get install --reinstall net-tools ifupdown
Looking at the systemctl status return, I bet everything was fine until dhclient/ifup requested renewal about 24 hours after initial connection (boot), found that networking.service was down, and couldn't renew, killing the network connection.
We'll see if it's actually fixed in 24 hours or so, but hopefully this little endeavour can help someone else plagued with this issue in the future. I'm still not sure exactly what caused it. I'll confirm tomorrow...
Update 3 - Looks like that was the culprit. Container is still connected 24+ hrs since reboot, network.service is still active, and dhclient was able to renew.
Update 4 - All was well and good until I started playing with setting up Traefik. Not sure if this brought it to the surface or if it just happened coincidentally, but networking.service failed again. Tried restarting the service, but it failed. Took a look in /etc/networking/interfaces and found there was an entry for iface eth0 inet6 dhcp
and I don't use ipv6. Removed that line and networking.service restarted successfully. Perhaps that was the issue the whole time.
LXC/LXD can be highly available (HA), stable, work and provide kernel isolation as well (real VMs): https://ubuntu.com/blog/lxd-virtual-machines-an-overview
In Proxmox?
There are some quirks with docker in LXC. Nothing that can’t be overcome, but docker in a VM is definitely more stable.