Once upon a time
- pfSense would host and handle certificates for the few, explicit applications I had running outside of Docker, and
- pfSense would transparently pass any implicit traffic down to my Docker hosts where I managed certificates via an automated process
But then, the
Fire nation pfSense 2.4.2 upgrade attacked!
My working setup was broken and I could no longer create any new LE certificates, let alone renew existing ones. They all reported similar issues:
LE standalone mode
In pfSense, you can specify how the ACME package is to prove its ownership of specific domains. These range from modifying
TXT DNS records, to operating an HTTP server and creating/deleting files in a specific directory.
Unfortunately for me, my DNS provider doesn't have a public API so the DNS record is out since I want an automated solution. I could have spun up a whatever-service to run an HTTP server, but that introduces another component that I could very well break. 😅
So, I decided to give the standalone mode a try.
Standalone dynamically creates an HTTP server that only exists during the execution of the
acme.sh script. Using
socat, it can successfully respond to any token challenge the LE server(s) decides to send. This
acme.sh script should only run when:
- creating a new domain not living inside my Docker deployment (rare), or
- renewing an existing, non-Docker service's certificate (every ~90 days)
According to the documentation, in standalone mode, the interactions are:
|1||Hi LE, I own blog.lolnope.us||Sign this nonce (0xacce55), and place data (123) at resource (blog.lolnope.us/abc)|
|2||Data (123) in place (blog.lolnope.us/abc), and the signed nonce (0x4B1D) is ready|
|3||I see the signed nonce (0x4B1D), and it is signed by your private key|
|4||I see data (123) at resource (blog.lolnope.us/abc)|
At this point, if there were no errors, LE would know that your public key is authorized for
blog.lolnope.us, and will allow you to create/revoke certificates for that domain.
Two things to note in the above example:
- Interaction 1 is a single HTTP
- LE reaches in twice during this exchange; once for the nonce (3), and a second time for the resource validation (4)
One leg of the journey
Since the error message from LE was something related to a "timeout", it means either:
acme.shisn't requesting the proper resource in time, or
- LE cannot reach the proper resource from my infrastructure
Option 1 was tentatively ruled out first since I knew it was working in the past, and more particularly, it was getting to LE to start the process in the first place. Still, I decided to look if the pfSense upgrade had messed up the
acme.sh script. I went as far as to
curl the source
.git checkout to my pfSense box and try to run the latest version of
acme.sh. That lead abruptly ended when I noticed failures due to
socat errors. The latest version must be slightly different than what pfSense's repo has packaged.
A version check of the
acme.shscript would have confirmed this, too...
Option 2 seemed more likely, but also didn't immediately make sense since I could see the
acme.sh logs which indicated that the LE were able to verify my nonce (step 3 mentioned above). I did a packet capture on the
acme.sh port and I saw traffic flowing. There was a request from LE for the nonce, but then nothing for the data provisioned at a resource.
At this point I thought the ACME pfSense package might be broken and posted about it in the pfSense forums. It was confirmed that the package was fine. Which caused me to further rage. 👺
So how can LE reach in successfully at first, and then time out afterwards??
HAProxy and its configuration file
I decided to inspect the configuration file of HAProxy, since pfSense has a relatively decent GUI which generates said config. When I opened the file and inspected the backend for this ACME verification step, everything seemed fine:
backend 0_HTTP_ACME_Standalone_http_ipvANY mode http log global timeout connect 30000 timeout server 30000 retries 3 server pfsense_0 127.0.0.1:8082 server pfsense_1 ::1:8082
This matched what I was expecting... I had a properly working NAT rule which passed HTTP 80 traffic to
localhost on port 8082. Everything was in place. Why wasn't this working? I even had a backend for IPv6!
Wait, why is there a backend for IPv6 when in the HAProxy pfSense GUI I had only specified "
localhost" as the backend? I understand
::1 is the IPv6 loopback address, but I had thought "
localhost" would have only generated
Hmm. This started me thinking.
socat to create a
LISTEN'ing socket, but does it do it for both IPv4 and IPv6 stacks? Nope:
[2.4.2-RELEASE][[email protected]]/tmp/acme: sockstat -l46 | grep 8082 root socat 96563 5 tcp4 *:8082 *:*
I ran this
sockstatcommand during the execution of certificate creation/renewal and it only ever listens on
acme.sh only creates an IPv4 socket, but the HAProxy backend has two configured, one for either IP version stack. But I also had specifically turned off load balancing for this backend.
I wonder what the default behavior of HAProxy is when there are multiple backend servers, but no balancing algorithm specified. Let's consult the documentation:
The load balancing algorithm of a backend is set to roundrobin when no other algorithm, mode nor option have been set. The algorithm may only be set once for each backend.
This knowledge started to make things fall into place.
HAProxy was properly accepting the first LE request because it was forwarding the request to the IPv4 backend server. But when the second LE request came in, it attempted to forward to the IPv6 backend server, there was no socket
LISTEN'ing, and the request would hang, eventually timing out.
All of this because either pfSense or HAProxy's package decided it was a good idea to turn "
The easy and obvious solution here is to explicitly create an IPv4 backend. In the pfSense HAProxy GUI, I removed "
localhost" and specifically listed the only backend server as "
127.0.0.1". As soon as I did this, the ACME protocol was unblocked and I was able to create/renew certificates again.
An arguably better solution here would be to configure the backend to have multiple retries configured across all backends which timeout quickly. This would make HAProxy responsible to issue multiple requests to the backend servers in whichever load balancing solution is chosen. LE wouldn't see HAProxy's multiple backend server requests (if needed). I'll have to figure out the proper config to accomplish this.
This was quite a dive!