Replacing an L7 header-routing nginx proxy with a GCP Application Load Balancer

Migrating a Kubernetes-deployed nginx proxy that dispatches HTTPS traffic by header value to a GCP global external Application Load Balancer using Internet NEGs and URL-map header matches.

June 9, 2026kubernetes2321 words · 12 min read
#gcp#load-balancer#nginx#terraform#kubernetes#aks

Table of Contents

What we are replacing

The starting point is a tiny nginx pod running on Azure Kubernetes Service. Its job is to terminate TLS for a single public hostname and forward each request to one of several upstream HTTPS hosts, chosen by the value of a custom header. A condensed version of the configmap:

map $http_x_tenant_id $backend {
  "T1" tenant1-admin.dev.example.com;
  "T2" tenant2-admin.dev.example.com;
  "T3" admin.partner.com;
  "T4" tenant4-admin.prod.example.com;
  default tenant1-admin.dev.example.com;
}

server {
  listen 80;

  location / {
    proxy_set_header Host $backend;
    proxy_pass https://$backend;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_ssl_server_name on;
    proxy_ssl_name $backend;
  }

  location /healthcheck {
    return 200 "OK";
  }
}

A nginx-ingress in front terminates TLS for proxy.example.com. The pod itself just listens on plain HTTP; the ingress controller rewrites Host headers as it goes. Useful properties that are easy to take for granted:

  • TLS termination and re-encryption to the upstream.
  • SNI to the upstream set to the upstream hostname proxy_ssl_server_name on.
  • Host header rewriting so the upstream sees a Host it recognises.
  • A fixed 200 OK response on /healthcheck for liveness/uptime monitoring.
  • An implicit default route that catches missing or unknown header values.

All of these have GCP-native equivalents on the global external Application Load Balancer, except one /healthcheck, which we will need to either drop or replace.

The target topology

client -> proxy.example.com -> GCP LB (TLS terminated)
   -> URL map (matcher-proxy)
      -> headerMatches: X-Tenant-ID == "T1" -> backend_service[tenant1-admin.dev.example.com]
      -> headerMatches: X-Tenant-ID == "T2" -> backend_service[tenant2-admin.dev.example.com]
      -> headerMatches: X-Tenant-ID == "T3" -> backend_service[admin.partner.com]
      -> headerMatches: X-Tenant-ID == "T4" -> backend_service[tenant4-admin.prod.example.com]
      -> default_service                       -> backend_service[tenant1-admin.dev.example.com]
         -> Internet NEG (INTERNET_FQDN_PORT, port 443)
            -> upstream over public HTTPS

Three GCP primitives do the work:

  1. Internet NEG of type INTERNET_FQDN_PORT — represents one upstream FQDN on a fixed port. GCP resolves the FQDN at request time and uses it as the SNI server name for the outbound TLS handshake.
  2. Backend service with protocol = HTTPS wraps an Internet NEG. It is the unit the URL map points at.
  3. URL map with route_rules whose match_rules.header_matches block provides header-based dispatch. Unmatched requests fall to path_matcher.default_service.

A managed cert plus a certificate-map entry handles TLS termination at the LB.

Internet NEGs and SNI

INTERNET_FQDN_PORT is the only NEG type that lets a global Application LB talk to an arbitrary host on the public internet. The endpoint is just a fqdn, port tuple. At request time GCP resolves the fqdn via public DNS, opens an HTTPS connection to one of the addresses, and sets the TLS SNI server name to the fqdn.

This replaces three nginx directives at once:

nginxGCP behavior
proxy_pass https://$backendBackend service of type HTTPS dialing the Internet NEG
proxy_ssl_server_name onSNI is on by default for HTTPS Internet NEGs
proxy_ssl_name $backendSNI value is the NEG endpoint's fqdn

The Host header sent to the upstream is also the NEG fqdn by default, so proxy_set_header Host $backend translates to "do nothing extra" — there is no need for customRequestHeaders to override Host.

URL map header-based routing

google_compute_url_map supports advanced route_rules whose match_rules.header_matches block compares a named header against an exact value (or a regex, prefix, suffix, presence, or range). Each rule gets a unique priority within its path_matcher; lower numbers win.

path_matcher {
  name            = "matcher-proxy"
  default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id

  # GCP requires route_rules to be SUBMITTED in priority-ascending order.
  # Terraform iterates maps alphabetically by key, so key by zero-padded
  # priority to force the iteration order to match the priority order.
  dynamic "route_rules" {
    for_each = { for r in var.routes : format("%05d", r.priority) => r }
    content {
      priority = route_rules.value.priority
      match_rules {
        prefix_match = "/"
        header_matches {
          header_name = "X-Tenant-ID"
          exact_match = route_rules.value.id
        }
      }
      service = google_compute_backend_service.proxy_backend[route_rules.value.fqdn].id
    }
  }
}

Three subtleties:

  • match_rules requires at least one path-matching field. prefix_match = "/" matches everything, leaving the routing decision purely on the header.
  • Header name matching is case-insensitive (HTTP standard). Header value matching with exact_match is case-sensitive — match exactly the string the client sends.
  • Iteration order matters. GCP rejects the API call with Invalid value for field 'resource.pathMatchers[N].routeRules[M].priority': 'X'. Within a pathMatcher, a route rule must have a priority that's higher than the previous route rule's priority if the rules arrive in non-ascending priority order. Terraform iterates maps alphabetically by key, so naively keying by the route id (e.g. "FA", "KG", "T1") produces submission order that has nothing to do with priority. Keying by zero-padded priority (format("%05d", r.priority)"00100", "00110", …) forces alphabetical iteration to match priority order. The actual priority attribute on each rule is still set from route_rules.value.priority — only the iteration order changes.

When no route_rules match (header missing, header value not in the table), the request falls through to default_service. This is the GCP equivalent of nginx's default map key.

Terraform translation

A complete, working example. var.routes and var.default_fqdn capture the nginx map; everything else is derived.

variable "routes" {
  type = list(object({
    id       = string
    fqdn     = string
    priority = number
  }))
  default = [
    { id = "T1", fqdn = "tenant1-admin.dev.example.com",   priority = 100 },
    { id = "T2", fqdn = "tenant2-admin.dev.example.com",   priority = 110 },
    { id = "T3", fqdn = "admin.partner.com",               priority = 120 },
    { id = "T4", fqdn = "tenant4-admin.prod.example.com",  priority = 130 },
  ]
}

variable "default_fqdn" {
  type    = string
  default = "tenant1-admin.dev.example.com"
}

locals {
  # Unique FQDNs across the routing table — one NEG / backend per fqdn.
  unique_fqdns = toset(concat([for r in var.routes : r.fqdn], [var.default_fqdn]))
}

# Internet NEG per unique upstream FQDN.
resource "google_compute_global_network_endpoint_group" "proxy_neg" {
  for_each              = local.unique_fqdns
  name                  = "proxy-neg-${replace(each.value, ".", "-")}"
  network_endpoint_type = "INTERNET_FQDN_PORT"
  default_port          = 443
}

resource "google_compute_global_network_endpoint" "proxy_endpoint" {
  for_each                      = local.unique_fqdns
  global_network_endpoint_group = google_compute_global_network_endpoint_group.proxy_neg[each.value].name
  fqdn                          = each.value
  port                          = 443
}

# Backend service per unique upstream FQDN.
resource "google_compute_backend_service" "proxy_backend" {
  for_each              = local.unique_fqdns
  name                  = "proxy-${replace(each.value, ".", "-")}"
  protocol              = "HTTPS"
  load_balancing_scheme = "EXTERNAL_MANAGED"

  backend {
    group = google_compute_global_network_endpoint_group.proxy_neg[each.value].id
  }

  depends_on = [google_compute_global_network_endpoint.proxy_endpoint]
}

# Managed cert + cert-map entry for the public hostname.
resource "google_certificate_manager_certificate" "proxy_cert" {
  name = "cert-proxy"
  managed {
    domains = ["proxy.example.com"]
  }
}

resource "google_certificate_manager_certificate_map_entry" "proxy_entry" {
  name         = "entry-proxy"
  map          = google_certificate_manager_certificate_map.main.name
  certificates = [google_certificate_manager_certificate.proxy_cert.id]
  hostname     = "proxy.example.com"
}

# URL map: one host_rule + one path_matcher; route_rules generated from var.routes.
resource "google_compute_url_map" "main" {
  name            = "main-url-map"
  default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id

  host_rule {
    hosts        = ["proxy.example.com"]
    path_matcher = "matcher-proxy"
  }

  path_matcher {
    name            = "matcher-proxy"
    default_service = google_compute_backend_service.proxy_backend[var.default_fqdn].id

    dynamic "route_rules" {
      # Key by zero-padded priority so iteration order matches priority order.
      # GCP rejects non-ascending submission order. See the URL-map section above.
      for_each = { for r in var.routes : format("%05d", r.priority) => r }
      content {
        priority = route_rules.value.priority
        match_rules {
          prefix_match = "/"
          header_matches {
            header_name = "X-Tenant-ID"
            exact_match = route_rules.value.id
          }
        }
        service = google_compute_backend_service.proxy_backend[route_rules.value.fqdn].id
      }
    }
  }
}

The HTTPS target proxy, forwarding rule, static IP, and 80 → 443 redirect are not specific to this proxy — they belong to the load balancer itself and likely already exist if you are adding the proxy to an existing LB. If not, they look like this:

resource "google_compute_global_address" "lb_ip" { name = "main-lb-ip" }

resource "google_compute_target_https_proxy" "main" {
  name            = "main-https-proxy"
  url_map         = google_compute_url_map.main.id
  certificate_map = "//certificatemanager.googleapis.com/${google_certificate_manager_certificate_map.main.id}"
}

resource "google_compute_global_forwarding_rule" "https" {
  name                  = "main-https-rule"
  target                = google_compute_target_https_proxy.main.id
  port_range            = "443"
  ip_address            = google_compute_global_address.lb_ip.address
  load_balancing_scheme = "EXTERNAL_MANAGED"
}

Differences vs the nginx version

There are four meaningful behavioral changes. None block the migration, but each one is the kind of thing that bites a week later if you do not notice.

X-Real-IP is no longer set. The GCP LB adds X-Forwarded-For and X-Forwarded-Proto automatically, but never X-Real-IP. Upstreams that specifically read X-Real-IP (instead of falling back to the first hop of X-Forwarded-For will see no client IP after the cutover. Grep the upstream code for X-Real-IP before you commit.

/healthcheck no longer returns 200 OK at the proxy. The external Application LB cannot synthesize a fixed-body response. A request to https://proxy.example.com/healthcheck falls through to default_service and hits whatever the default backend returns for that path. Any external uptime monitor, runbook step, or Kubernetes-style readiness probe hitting this URL will see different behavior. The fix is either to repoint the monitor at a real endpoint behind the proxy, or to add a tiny Cloud Run "ok" service as an extra backend and route /healthcheck to it explicitly.

Upstream TLS validation is stricter. nginx with proxy_pass https://... does not verify the upstream certificate by default. GCP backend services with HTTPS to Internet NEGs validate the upstream cert against the public trust store. If any upstream presents a self-signed or expired cert today, the LB handshake will fail where nginx silently succeeded. Confirm every upstream serves a publicly trusted cert before the cutover.

The client-facing TLS cert changes. Before: whatever cert-manager issued (often Let's Encrypt). After: a Google-managed cert. The cert chain and root CA differ. Any client that pins the certificate or its CA chain will start failing TLS validation. Pinning is rare but unevenly documented, so it is worth asking around if the consumers of this proxy are out of your control.

Two more changes that are usually cosmetic:

  • The LB terminates HTTP/2 (and optionally HTTP/3) from the client by default; nginx terminated HTTP/1.1. Upstreams continue to see HTTP/1.1 or HTTP/2 depending on the backend service configuration. Almost never matters for proxy use cases.
  • The source IP visible to the upstream is now a GCP egress address rather than a Kubernetes node address. The original client IP only survives via X-Forwarded-For. If any upstream has a source-IP allowlist, it needs updating to the new GCP egress range.

Cutover playbook

The standard four-step zero-downtime rollout.

  1. Apply the Terraform. New cert provisioning takes around five minutes once DNS is ready. Confirm gcloud certificate-manager certificates describe cert-proxy shows state: ACTIVE. The LB does not respond correctly to the proxy hostname until this is green.

  2. Smoke test by hand against the LB IP before changing public DNS:

    LB_IP=$(terraform output -raw lb_ip)
    for ID in T1 T2 T3 T4; do
      echo "=== $ID ==="
      curl -sk --resolve proxy.example.com:443:${LB_IP} \
        -H "X-Tenant-ID: ${ID}" \
        -o /dev/null -w "%{http_code} -> %{remote_ip}\n" \
        https://proxy.example.com/
    done
    

    Each request should hit the corresponding upstream, return 2xx (or whatever that backend's root path normally returns), and never the default fallback response.

  3. Cut DNS. Point proxy.example.com at the LB static IP. Watch LB logs in Cloud Logging:

    resource.type="http_load_balancer"
    httpRequest.requestUrl=~"proxy.example.com"
    

    Sanity-check the routing distribution against the upstream selection logs on each backend.

  4. Decommission the nginx pod. Once the LB has been serving production for at least one DNS TTL plus a comfortable margin and metrics look right: helm uninstall proxy in the cluster, archive the chart repo, drop the cert-manager Certificate resource and TLS secret.

If anything looks wrong during step 3, the rollback is to revert the DNS change. Nothing about the nginx-side state has been touched yet.

Adding, removing, or changing routes

After the cutover, the var.routes list is the only thing you edit to change routing. Everything downstream — NEGs, endpoints, backend services, URL map rules — is derived via for_each.

ChangePlan outputRisk
Add entry with new fqdn+ neg, + endpoint, + backend, URL map updateNone — additive
Add entry with fqdn already in another entryURL map update onlyNone — additive
Remove entry, fqdn unique to it- backend, - endpoint, - neg, URL map updateRemoved tenant's traffic falls to default_service
Remove entry, fqdn sharedURL map update onlySame as above
Change priorityURL map updateNone — header matches are mutually exclusive
Change id (header value)URL map updateOld header value falls to default until clients update
Change fqdnPossibly destroy old + create new infra, plus URL map updateBrief mid-apply window before the new NEG is resolvable

terraform plan always tells the truth. If a plan shows destroys you did not expect, stop and re-read the diff before applying.

Debugging

The four useful starting points when something is not working as expected.

Cert not active. Managed certs need the domain to already resolve to the LB's static IP so the ACME challenge can succeed.

gcloud certificate-manager certificates describe cert-proxy

If state is PROVISIONING, fix DNS and wait. If it stays stuck, the cert manager's status block names the failing domain.

Wrong backend selected. Hit the LB with curl --resolve to bypass DNS, then inspect the LB request log:

gcloud logging read 'resource.type="http_load_balancer"
  AND httpRequest.requestUrl=~"proxy.example.com"' \
  --project=<project> --limit=20 --format=json \
  | jq '.[].jsonPayload.backendServiceName'

The backendServiceName field tells you which backend the URL map chose. If it does not match the header value sent, double-check the header_matches exact_match string for case and trailing whitespace.

Upstream TLS failures. The LB logs include a statusDetails field with values like backend_connection_closed_before_data_sent_to_client, failed_to_pick_backend, or client_disconnected_before_any_response. For TLS issues specifically, look for log entries with the same backend service but no response code:

gcloud logging read 'resource.type="http_load_balancer"
  AND jsonPayload.backendServiceName="proxy-<fqdn>"
  AND httpRequest.status=0' \
  --project=<project> --freshness=1h --limit=10

Then reproduce locally to confirm:

openssl s_client -connect <fqdn>:443 -servername <fqdn> < /dev/null

If the cert chain is incomplete or untrusted, you have found the cause.

Unexpected 404s. Almost always a stale URL map. Force a plan/apply and make sure the host_rule's hosts list and the path_matcher's name still match each other.

What it costs

A global external Application LB has a base cost of around $18 per month plus per-request and per-GB charges. Internet NEG traffic counts as internet egress on the LB's project, not intra-region — for a stage backend in a different project, you are paying public-internet egress rates even though both endpoints are on GCP. At the kinds of request volumes a tenant routing proxy typically sees this is small money, but it is worth knowing before quoting numbers to anyone.