HAProxy and localhost

Once upon a time

I had a working pfSense, HAProxy, and LetsEncrypt (LE) setup:

pfSense would host and handle certificates for the few, explicit applications I had running outside of Docker, and
pfSense would transparently pass any implicit traffic down to my Docker hosts where I managed certificates via an automated process

But then, the ~~Fire nation~~ pfSense 2.4.2 upgrade attacked!

My working setup was broken and I could no longer create any new LE certificates, let alone renew existing ones. They all reported similar issues:

blog.lolnope.us:Verify error:Fetching http://blog.lolnope.us/.well-known/acme-challenge/${TOKEN}: Timeout

LE standalone mode

In pfSense, you can specify how the ACME package is to prove its ownership of specific domains. These range from modifying TXT DNS records, to operating an HTTP server and creating/deleting files in a specific directory.

Unfortunately for me, my DNS provider doesn't have a public API so the DNS record is out since I want an automated solution. I could have spun up a whatever-service to run an HTTP server, but that introduces another component that I could very well break. 😅
So, I decided to give the standalone mode a try.

Standalone dynamically creates an HTTP server that only exists during the execution of the acme.sh script. Using nc or socat, it can successfully respond to any token challenge the LE server(s) decides to send. This acme.sh script should only run when:

creating a new domain not living inside my Docker deployment (rare), or
renewing an existing, non-Docker service's certificate (every ~90 days)

According to the documentation, in standalone mode, the interactions are:

	Client (`acme.sh`)	LE Certificate Authority
1	Hi LE, I own `blog.lolnope.us`	Sign this nonce (`0xacce55`), and place data (`123`) at resource (`blog.lolnope.us/abc`)
2	Data (`123`) in place (`blog.lolnope.us/abc`), and the signed nonce (`0x4B1D`) is ready
3		I see the signed nonce (`0x4B1D`), and it is signed by your private key
4		I see data (`123`) at resource (`blog.lolnope.us/abc`)

At this point, if there were no errors, LE would know that your public key is authorized for blog.lolnope.us, and will allow you to create/revoke certificates for that domain.

Two things to note in the above example:

Interaction 1 is a single HTTP GET session
LE reaches in twice during this exchange; once for the nonce (3), and a second time for the resource validation (4)

One leg of the journey

Since the error message from LE was something related to a "timeout", it means either:

acme.sh isn't requesting the proper resource in time, or
LE cannot reach the proper resource from my infrastructure

Option 1 was tentatively ruled out first since I knew it was working in the past, and more particularly, it was getting to LE to start the process in the first place. Still, I decided to look if the pfSense upgrade had messed up the acme.sh script. I went as far as to curl the source .git checkout to my pfSense box and try to run the latest version of acme.sh. That lead abruptly ended when I noticed failures due to nc/socat errors. The latest version must be slightly different than what pfSense's repo has packaged.

A version check of the acme.sh script would have confirmed this, too...

Option 2 seemed more likely, but also didn't immediately make sense since I could see the acme.sh logs which indicated that the LE were able to verify my nonce (step 3 mentioned above). I did a packet capture on the acme.sh port and I saw traffic flowing. There was a request from LE for the nonce, but then nothing for the data provisioned at a resource.

At this point I thought the ACME pfSense package might be broken and posted about it in the pfSense forums. It was confirmed that the package was fine. Which caused me to further rage. 👺

So how can LE reach in successfully at first, and then time out afterwards??

HAProxy and its configuration file

I decided to inspect the configuration file of HAProxy, since pfSense has a relatively decent GUI which generates said config. When I opened the file and inspected the backend for this ACME verification step, everything seemed fine:

backend 0_HTTP_ACME_Standalone_http_ipvANY  
    mode            http
    log         global
    timeout connect     30000
    timeout server      30000
    retries         3
    server          pfsense_0 127.0.0.1:8082  
    server          pfsense_1 ::1:8082

This matched what I was expecting... I had a properly working NAT rule which passed HTTP 80 traffic to localhost on port 8082. Everything was in place. Why wasn't this working? I even had a backend for IPv6!

...

Wait, why is there a backend for IPv6 when in the HAProxy pfSense GUI I had only specified "localhost" as the backend? I understand ::1 is the IPv6 loopback address, but I had thought "localhost" would have only generated 127.0.0.1.

Hmm. This started me thinking. acme.sh uses nc/socat to create a LISTEN'ing socket, but does it do it for both IPv4 and IPv6 stacks? Nope:

[2.4.2-RELEASE][root@pfsense]/tmp/acme: sockstat -l46 | grep 8082
root     socat      96563 5  tcp4   *:8082                *:*

I ran this sockstat command during the execution of certificate creation/renewal and it only ever listens on 127.0.0.1:8082, not ::1:8082.

"Helpful" defaults

So acme.sh only creates an IPv4 socket, but the HAProxy backend has two configured, one for either IP version stack. But I also had specifically turned off load balancing for this backend.

I wonder what the default behavior of HAProxy is when there are multiple backend servers, but no balancing algorithm specified. Let's consult the documentation:

The load balancing algorithm of a backend is set to roundrobin when no other algorithm, mode nor option have been set. The algorithm may only be set once for each backend.

This knowledge started to make things fall into place.

HAProxy was properly accepting the first LE request because it was forwarding the request to the IPv4 backend server. But when the second LE request came in, it attempted to forward to the IPv6 backend server, there was no socket LISTEN'ing, and the request would hang, eventually timing out.

All of this because either pfSense or HAProxy's package decided it was a good idea to turn "localhost" into 127.0.0.1 and ::1.

Solution(s)

The easy and obvious solution here is to explicitly create an IPv4 backend. In the pfSense HAProxy GUI, I removed "localhost" and specifically listed the only backend server as "127.0.0.1". As soon as I did this, the ACME protocol was unblocked and I was able to create/renew certificates again.

An arguably better solution here would be to configure the backend to have multiple retries configured across all backends which timeout quickly. This would make HAProxy responsible to issue multiple requests to the backend servers in whichever load balancing solution is chosen. LE wouldn't see HAProxy's multiple backend server requests (if needed). I'll have to figure out the proper config to accomplish this.

This was quite a dive!