Tag: ITstories

  • Sensitive LXC container networking — or How to Break and fix GitLab CI without meaning to

    A few days ago, one of my clients ran into an issue: the CI pipeline for the develop branch suddenly stopped working. Strangely enough, pipelines for other branches — feature, hotfix — were running just fine. The situation seemed odd.

    The project setup is such that each branch (develop, feature, hotfix) is deployed on a separate dedicated server inside LXC containers. (master and release are handled differently.) Everything is self-contained. And yet, only the develop branch pipeline stopped triggering.

    The first thing I did was check the GitLab pipeline status. Either it was marked as “stuck,” or it timed out. The server, containers, and host machine were all functioning normally. So I checked the GitLab Runner inside the container — turns out it couldn’t reach GitLab. That gave me a clue.

    I started thinking: what could’ve changed? There had been no major updates or migrations in the last month. But then I remembered — not long ago, I added another virtual bridge (network interface) for a separate container. Everything worked fine at the time, so I forgot about it. But now this new issue popped up with the develop container. Could it be related?

    Okay, time to check network connectivity inside the develop container:
    ping 8.8.8.8 — nothing. No internet.
    Checked the host — all good.
    Could it be the firewall? Nope.
    Pinged another container — it responded fine.
    So now it’s 99% likely that the issue is with routing, NAT, or DNS inside the LXD container. Let’s see what’s going on with NAT:

    > iptables -t nat -L -n -v

    Nothing. No MASQUERADE rule. Bingo!

    Adding the new bridge must have cleared out the NAT rules. As a result, traffic from the develop container wasn’t being masked and couldn’t reach the internet. Without internet, GitLab Runner couldn’t talk to the Git server — and the pipeline never triggered.

    I tried restoring the rules manually:

    > lxc exec <container-name> -- systemctl restart systemd-networkd

    — didn’t help.


    Restarting the container — nothing.
    Tried a few other tricks — still no luck.

    I even considered creating a new bridge, but before going that far, I decided to try the simplest thing: just reattach the bridge to the container.

    > lxc network detach <bridge-name> <container-name> <interface-name>
    > lxc network attach <bridge-name> <container-name> <interface-name>

    It worked. The container regained internet access.
    I ran a runner verification:

    > gitlab-runner verify

    GitLab Runner happily reported a successful connection, and the pipelines started running again.

    The whole thing took about an hour. The hardest part wasn’t technical — it was diagnosing where to start, what to check, and how to connect the dots. In the end, the fix was almost laughably simple: “reconnect.”

    It reminded me of an old rule I learned back in the day when I worked in tech support for a telecom company. When a customer would call in saying “Nothing works, everything’s broken,” we had a golden first response: “Did you try turning it off and on again?”
    And honestly? That solved the issue in about 40% of cases. Just like this time.