Table of Contents
- What we are replacing
- The target topology
- Internet NEGs and SNI
- URL map header-based routing
- Terraform translation
- Differences vs the nginx version
- Cutover playbook
- Adding, removing, or changing routes
- Debugging
- What it costs
What we are replacing
The starting point is a tiny nginx pod running on Azure Kubernetes Service. Its job is to terminate TLS for a single public hostname and forward each request to one of several upstream HTTPS hosts, chosen by the value of a custom header. A condensed version of the configmap:
map $http_x_tenant_id $backend {
"T1" tenant1-admin.dev.example.com;
"T2" tenant2-admin.dev.example.com;
"T3" admin.partner.com;
"T4" tenant4-admin.prod.example.com;
default tenant1-admin.dev.example.com;
}
server {
listen 80;
location / {
proxy_set_header Host $backend;
proxy_pass https://$backend;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_ssl_server_name on;
proxy_ssl_name $backend;
}
location /healthcheck {
return 200 "OK";
}
}
A nginx-ingress in front terminates TLS for proxy.example.com. The pod itself just listens on plain HTTP; the ingress controller rewrites Host headers as it goes. Useful properties that are easy to take for granted:
- TLS termination and re-encryption to the upstream.
- SNI to the upstream set to the upstream hostname
proxy_ssl_server_name on. - Host header rewriting so the upstream sees a Host it recognises.
- A fixed
200 OKresponse on/healthcheckfor liveness/uptime monitoring. - An implicit
defaultroute that catches missing or unknown header values.
All of these have GCP-native equivalents on the global external Application Load Balancer, except one /healthcheck, which we will need to either drop or replace.
The target topology
client -> proxy.example.com -> GCP LB (TLS terminated)
-> URL map (matcher-proxy)
-> headerMatches: X-Tenant-ID == "T1" -> backend_service[tenant1-admin.dev.example.com]
-> headerMatches: X-Tenant-ID == "T2" -> backend_service[tenant2-admin.dev.example.com]
-> headerMatches: X-Tenant-ID == "T3" -> backend_service[admin.partner.com]
-> headerMatches: X-Tenant-ID == "T4" -> backend_service[tenant4-admin.prod.example.com]
-> default_service -> backend_service[tenant1-admin.dev.example.com]
-> Internet NEG (INTERNET_FQDN_PORT, port 443)
-> upstream over public HTTPS
Three GCP primitives do the work:
- Internet NEG of type
INTERNET_FQDN_PORT— represents one upstream FQDN on a fixed port. GCP resolves the FQDN at request time and uses it as the SNI server name for the outbound TLS handshake. - Backend service with
protocol = HTTPSwraps an Internet NEG. It is the unit the URL map points at. - URL map with
route_ruleswhosematch_rules.header_matchesblock provides header-based dispatch. Unmatched requests fall topath_matcher.default_service.
A managed cert plus a certificate-map entry handles TLS termination at the LB.
Internet NEGs and SNI
INTERNET_FQDN_PORT is the only NEG type that lets a global Application LB talk to an arbitrary host on the public internet. The endpoint is just a fqdn, port tuple. At request time GCP resolves the fqdn via public DNS, opens an HTTPS connection to one of the addresses, and sets the TLS SNI server name to the fqdn.
This replaces three nginx directives at once:
| nginx | GCP behavior |
|---|---|
proxy_pass https://$backend | Backend service of type HTTPS dialing the Internet NEG |
proxy_ssl_server_name on | SNI is on by default for HTTPS Internet NEGs |
proxy_ssl_name $backend | SNI value is the NEG endpoint's fqdn |
The Host header sent to the upstream is also the NEG fqdn by default, so proxy_set_header Host $backend translates to "do nothing extra" — there is no need for customRequestHeaders to override Host.
URL map header-based routing
google_compute_url_map supports advanced route_rules whose match_rules.header_matches block compares a named header against an exact value (or a regex, prefix, suffix, presence, or range). Each rule gets a unique priority within its path_matcher; lower numbers win.
path_matcher {
name = "matcher-proxy"
default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id
# GCP requires route_rules to be SUBMITTED in priority-ascending order.
# Terraform iterates maps alphabetically by key, so key by zero-padded
# priority to force the iteration order to match the priority order.
dynamic "route_rules" {
for_each = { for r in var.routes : format("%05d", r.priority) => r }
content {
priority = route_rules.value.priority
match_rules {
prefix_match = "/"
header_matches {
header_name = "X-Tenant-ID"
exact_match = route_rules.value.id
}
}
service = google_compute_backend_service.proxy_backend[route_rules.value.fqdn].id
}
}
}
Three subtleties:
match_rulesrequires at least one path-matching field.prefix_match = "/"matches everything, leaving the routing decision purely on the header.- Header name matching is case-insensitive (HTTP standard). Header value matching with
exact_matchis case-sensitive — match exactly the string the client sends. - Iteration order matters. GCP rejects the API call with
Invalid value for field 'resource.pathMatchers[N].routeRules[M].priority': 'X'. Within a pathMatcher, a route rule must have a priority that's higher than the previous route rule's priorityif the rules arrive in non-ascending priority order. Terraform iterates maps alphabetically by key, so naively keying by the route id (e.g."FA","KG","T1") produces submission order that has nothing to do with priority. Keying by zero-padded priority (format("%05d", r.priority)→"00100","00110", …) forces alphabetical iteration to match priority order. The actualpriorityattribute on each rule is still set fromroute_rules.value.priority— only the iteration order changes.
When no route_rules match (header missing, header value not in the table), the request falls through to default_service. This is the GCP equivalent of nginx's default map key.
Terraform translation
A complete, working example. var.routes and var.default_fqdn capture the nginx map; everything else is derived.
variable "routes" {
type = list(object({
id = string
fqdn = string
priority = number
}))
default = [
{ id = "T1", fqdn = "tenant1-admin.dev.example.com", priority = 100 },
{ id = "T2", fqdn = "tenant2-admin.dev.example.com", priority = 110 },
{ id = "T3", fqdn = "admin.partner.com", priority = 120 },
{ id = "T4", fqdn = "tenant4-admin.prod.example.com", priority = 130 },
]
}
variable "default_fqdn" {
type = string
default = "tenant1-admin.dev.example.com"
}
locals {
# Unique FQDNs across the routing table — one NEG / backend per fqdn.
unique_fqdns = toset(concat([for r in var.routes : r.fqdn], [var.default_fqdn]))
}
# Internet NEG per unique upstream FQDN.
resource "google_compute_global_network_endpoint_group" "proxy_neg" {
for_each = local.unique_fqdns
name = "proxy-neg-${replace(each.value, ".", "-")}"
network_endpoint_type = "INTERNET_FQDN_PORT"
default_port = 443
}
resource "google_compute_global_network_endpoint" "proxy_endpoint" {
for_each = local.unique_fqdns
global_network_endpoint_group = google_compute_global_network_endpoint_group.proxy_neg[each.value].name
fqdn = each.value
port = 443
}
# Backend service per unique upstream FQDN.
resource "google_compute_backend_service" "proxy_backend" {
for_each = local.unique_fqdns
name = "proxy-${replace(each.value, ".", "-")}"
protocol = "HTTPS"
load_balancing_scheme = "EXTERNAL_MANAGED"
backend {
group = google_compute_global_network_endpoint_group.proxy_neg[each.value].id
}
depends_on = [google_compute_global_network_endpoint.proxy_endpoint]
}
# Managed cert + cert-map entry for the public hostname.
resource "google_certificate_manager_certificate" "proxy_cert" {
name = "cert-proxy"
managed {
domains = ["proxy.example.com"]
}
}
resource "google_certificate_manager_certificate_map_entry" "proxy_entry" {
name = "entry-proxy"
map = google_certificate_manager_certificate_map.main.name
certificates = [google_certificate_manager_certificate.proxy_cert.id]
hostname = "proxy.example.com"
}
# URL map: one host_rule + one path_matcher; route_rules generated from var.routes.
resource "google_compute_url_map" "main" {
name = "main-url-map"
default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id
host_rule {
hosts = ["proxy.example.com"]
path_matcher = "matcher-proxy"
}
path_matcher {
name = "matcher-proxy"
default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id
dynamic "route_rules" {
# Key by zero-padded priority so iteration order matches priority order.
# GCP rejects non-ascending submission order. See the URL-map section above.
for_each = { for r in var.routes : format("%05d", r.priority) => r }
content {
priority = route_rules.value.priority
match_rules {
prefix_match = "/"
header_matches {
header_name = "X-Tenant-ID"
exact_match = route_rules.value.id
}
}
service = google_compute_backend_service.proxy_backend[route_rules.value.fqdn].id
}
}
}
}
The HTTPS target proxy, forwarding rule, static IP, and 80 → 443 redirect are not specific to this proxy — they belong to the load balancer itself and likely already exist if you are adding the proxy to an existing LB. If not, they look like this:
resource "google_compute_global_address" "lb_ip" { name = "main-lb-ip" }
resource "google_compute_target_https_proxy" "main" {
name = "main-https-proxy"
url_map = google_compute_url_map.main.id
certificate_map = "//certificatemanager.googleapis.com/${google_certificate_manager_certificate_map.main.id}"
}
resource "google_compute_global_forwarding_rule" "https" {
name = "main-https-rule"
target = google_compute_target_https_proxy.main.id
port_range = "443"
ip_address = google_compute_global_address.lb_ip.address
load_balancing_scheme = "EXTERNAL_MANAGED"
}
Differences vs the nginx version
There are four meaningful behavioral changes. None block the migration, but each one is the kind of thing that bites a week later if you do not notice.
X-Real-IP is no longer set. The GCP LB adds X-Forwarded-For and X-Forwarded-Proto automatically, but never X-Real-IP. Upstreams that specifically read X-Real-IP (instead of falling back to the first hop of X-Forwarded-For will see no client IP after the cutover. Grep the upstream code for X-Real-IP before you commit.
/healthcheck no longer returns 200 OK at the proxy. The external Application LB cannot synthesize a fixed-body response. A request to https://proxy.example.com/healthcheck falls through to default_service and hits whatever the default backend returns for that path. Any external uptime monitor, runbook step, or Kubernetes-style readiness probe hitting this URL will see different behavior. The fix is either to repoint the monitor at a real endpoint behind the proxy, or to add a tiny Cloud Run "ok" service as an extra backend and route /healthcheck to it explicitly.
Upstream TLS validation is stricter. nginx with proxy_pass https://... does not verify the upstream certificate by default. GCP backend services with HTTPS to Internet NEGs validate the upstream cert against the public trust store. If any upstream presents a self-signed or expired cert today, the LB handshake will fail where nginx silently succeeded. Confirm every upstream serves a publicly trusted cert before the cutover.
The client-facing TLS cert changes. Before: whatever cert-manager issued (often Let's Encrypt). After: a Google-managed cert. The cert chain and root CA differ. Any client that pins the certificate or its CA chain will start failing TLS validation. Pinning is rare but unevenly documented, so it is worth asking around if the consumers of this proxy are out of your control.
Two more changes that are usually cosmetic:
- The LB terminates HTTP/2 (and optionally HTTP/3) from the client by default; nginx terminated HTTP/1.1. Upstreams continue to see HTTP/1.1 or HTTP/2 depending on the backend service configuration. Almost never matters for proxy use cases.
- The source IP visible to the upstream is now a GCP egress address rather than a Kubernetes node address. The original client IP only survives via
X-Forwarded-For. If any upstream has a source-IP allowlist, it needs updating to the new GCP egress range.
Cutover playbook
The standard four-step zero-downtime rollout.
-
Apply the Terraform. New cert provisioning takes around five minutes once DNS is ready. Confirm
gcloud certificate-manager certificates describe cert-proxyshowsstate: ACTIVE. The LB does not respond correctly to the proxy hostname until this is green. -
Smoke test by hand against the LB IP before changing public DNS:
LB_IP=$(terraform output -raw lb_ip) for ID in T1 T2 T3 T4; do echo "=== $ID ===" curl -sk --resolve proxy.example.com:443:${LB_IP} \ -H "X-Tenant-ID: ${ID}" \ -o /dev/null -w "%{http_code} -> %{remote_ip}\n" \ https://proxy.example.com/ doneEach request should hit the corresponding upstream, return 2xx (or whatever that backend's root path normally returns), and never the default fallback response.
-
Cut DNS. Point
proxy.example.comat the LB static IP. Watch LB logs in Cloud Logging:resource.type="http_load_balancer" httpRequest.requestUrl=~"proxy.example.com"Sanity-check the routing distribution against the upstream selection logs on each backend.
-
Decommission the nginx pod. Once the LB has been serving production for at least one DNS TTL plus a comfortable margin and metrics look right:
helm uninstall proxyin the cluster, archive the chart repo, drop the cert-managerCertificateresource and TLS secret.
If anything looks wrong during step 3, the rollback is to revert the DNS change. Nothing about the nginx-side state has been touched yet.
Adding, removing, or changing routes
After the cutover, the var.routes list is the only thing you edit to change routing. Everything downstream — NEGs, endpoints, backend services, URL map rules — is derived via for_each.
| Change | Plan output | Risk |
|---|---|---|
| Add entry with new fqdn | + neg, + endpoint, + backend, URL map update | None — additive |
| Add entry with fqdn already in another entry | URL map update only | None — additive |
| Remove entry, fqdn unique to it | - backend, - endpoint, - neg, URL map update | Removed tenant's traffic falls to default_service |
| Remove entry, fqdn shared | URL map update only | Same as above |
| Change priority | URL map update | None — header matches are mutually exclusive |
| Change id (header value) | URL map update | Old header value falls to default until clients update |
| Change fqdn | Possibly destroy old + create new infra, plus URL map update | Brief mid-apply window before the new NEG is resolvable |
terraform plan always tells the truth. If a plan shows destroys you did not expect, stop and re-read the diff before applying.
Debugging
The four useful starting points when something is not working as expected.
Cert not active. Managed certs need the domain to already resolve to the LB's static IP so the ACME challenge can succeed.
gcloud certificate-manager certificates describe cert-proxy
If state is PROVISIONING, fix DNS and wait. If it stays stuck, the cert manager's status block names the failing domain.
Wrong backend selected. Hit the LB with curl --resolve to bypass DNS, then inspect the LB request log:
gcloud logging read 'resource.type="http_load_balancer"
AND httpRequest.requestUrl=~"proxy.example.com"' \
--project=<project> --limit=20 --format=json \
| jq '.[].jsonPayload.backendServiceName'
The backendServiceName field tells you which backend the URL map chose. If it does not match the header value sent, double-check the header_matches exact_match string for case and trailing whitespace.
Upstream TLS failures. The LB logs include a statusDetails field with values like backend_connection_closed_before_data_sent_to_client, failed_to_pick_backend, or client_disconnected_before_any_response. For TLS issues specifically, look for log entries with the same backend service but no response code:
gcloud logging read 'resource.type="http_load_balancer"
AND jsonPayload.backendServiceName="proxy-<fqdn>"
AND httpRequest.status=0' \
--project=<project> --freshness=1h --limit=10
Then reproduce locally to confirm:
openssl s_client -connect <fqdn>:443 -servername <fqdn> < /dev/null
If the cert chain is incomplete or untrusted, you have found the cause.
Unexpected 404s. Almost always a stale URL map. Force a plan/apply and make sure the host_rule's hosts list and the path_matcher's name still match each other.
What it costs
A global external Application LB has a base cost of around $18 per month plus per-request and per-GB charges. Internet NEG traffic counts as internet egress on the LB's project, not intra-region — for a stage backend in a different project, you are paying public-internet egress rates even though both endpoints are on GCP. At the kinds of request volumes a tenant routing proxy typically sees this is small money, but it is worth knowing before quoting numbers to anyone.