All posts
·7 min read

Building a Serverless Certificate Authority with AWS KMS

SPIRE requires a persistent daemon — incompatible with serverless. We built a Lambda-based CA with signing keys in KMS that issues SPIFFE SVIDs in under 100ms. Here is the design.

Issuing cryptographic identities for AI agents requires a Certificate Authority. The CA private key is the most sensitive secret in the system: whoever controls it can issue valid credentials for any agent in any trust domain. Getting the CA design right is not optional.

We evaluated three approaches: SPIRE (the CNCF reference implementation), HashiCorp Vault PKI, and a custom Lambda-based CA backed by AWS KMS. We chose the third. Here is the reasoning and the implementation.

Why SPIRE does not fit a serverless architecture

SPIRE works by running two components: a Server and an Agent. The Server is the CA. The Agent runs on each node and attests workloads by inspecting local state (kernel metadata, container runtime, cloud instance identity documents). Once attested, workloads receive an SVID from the Server.

The problem is structural: both components are persistent processes. In our architecture, every backend component runs as an AWS Lambda function. Lambda invocations are ephemeral — they spin up in ~10ms, execute, and terminate. There is no persistent process to run the SPIRE Agent. There is no persistent node for the SPIRE Server.

Even if we ran a SPIRE Server on ECS or EC2, we would introduce a persistent server into an otherwise serverless architecture — breaking the zero-maintenance, scale-to-zero properties we depend on. We needed a CA that is itself serverless.

Why Vault PKI has the same structural problem

HashiCorp Vault is excellent software. Vault PKI can issue X.509 certificates with SPIFFE-format URIs. But Vault is also a persistent service. Running highly available Vault in AWS requires at minimum an ECS Fargate cluster with a raft backend, auto-unseal via KMS, and careful availability zone design. That is significant operational overhead for a team of two.

The KMS Lambda CA design

AWS KMS supports asymmetric key pairs with RSA_2048, RSA_4096, ECC_NIST_P256, and ECC_NIST_P521. The private key is generated inside KMS HSMs and never extracted. All signing operations happen inside KMS via the kms:Sign API call. CloudTrail records every signing operation. The public key can be exported for certificate verification.

Our CA is a Lambda function that:

  • Receives an SVID issuance request with agent metadata (name, org, team, scope, TTL)
  • Generates a certificate template with the SPIFFE URI as the Subject Alternative Name
  • Calls kms:Sign with the certificate TBS (to-be-signed) bytes and the CA key ARN
  • Returns the signed X.509 certificate — the SVID
go
// adapters/kms/ca.go — simplified
func (ca *KMSCertificateAuthority) IssueSVID(
    ctx context.Context,
    req IssueSVIDRequest,
) (*x509.Certificate, error) {
    // Build SPIFFE URI: spiffe://trustwarden/{org}/{team}/{name}/{instance}
    spiffeURI := fmt.Sprintf(
        "spiffe://trustwarden/%s/%s/%s/%s",
        req.Org, req.Team, req.Name, req.InstanceID,
    )

    // Build X.509 template
    template := &x509.Certificate{
        SerialNumber: generateSerialNumber(),
        Subject:      pkix.Name{CommonName: req.Name},
        URIs:         []*url.URL{must(url.Parse(spiffeURI))},
        NotBefore:    time.Now(),
        NotAfter:     time.Now().Add(req.TTL),
        KeyUsage:     x509.KeyUsageDigitalSignature,
        ExtKeyUsage:  []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
    }

    // Sign via KMS — private key never leaves HSM
    tbsBytes := buildTBSCertificate(template, ca.publicKey)
    signOutput, err := ca.kmsClient.Sign(ctx, &kms.SignInput{
        KeyId:            &ca.keyARN,
        Message:          tbsBytes,
        MessageType:      types.MessageTypeDigest,
        SigningAlgorithm: types.SigningAlgorithmSpecRsassaPkcs1V15Sha256,
    })
    if err != nil {
        return nil, fmt.Errorf("kms.Sign: %w", err)
    }

    return assembleCertificate(template, signOutput.Signature, ca.caCert)
}

Performance: under 100ms p99

Identity issuance is on the critical path for every agent instantiation. A slow CA blocks every AgentIdentity.create() call. We set a hard requirement: p99 must be under 100ms, including Lambda cold start.

Go on provided.al2023 starts in approximately 8–12ms (compiled binary, no interpreter). The kms:Sign call in us-east-1 runs at 15–25ms p50, 35–45ms p99. DynamoDB writes for the Trust Registry run at 5–10ms p50. Total issuance: 30–70ms p50, 85–95ms p99. We meet the requirement with margin.

Why Go over Python or Node.js for Lambda?

Python Lambda cold starts range from 200–800ms depending on dependency count. Node.js is similar. Go compiles to a single static binary with no runtime — cold start is 10ms regardless of how many packages you import. For an identity issuance function on the critical path, the choice is easy.

Security properties

  • The CA private key is generated in KMS HSMs and never extracted under any circumstance — not even to us.
  • Every signing operation is recorded in CloudTrail with the caller identity, key ARN, and timestamp.
  • The Lambda execution role has kms:Sign permission for the CA key only — least privilege enforced via IAM.
  • SVIDs are short-lived by design (configurable TTL, default 1h). Compromise of a certificate is time-bounded.
  • Certificate serial numbers are random 128-bit values — no enumeration attacks.

SPIFFE format preserved

We chose not to use SPIRE but we preserved the SPIFFE URI format. This is deliberate. SPIFFE is an open CNCF standard with growing adoption across service meshes (Istio, Envoy, Linkerd) and cloud providers (AWS Roles Anywhere, GCP Workload Identity). Issuing SPIFFE-format SVIDs means TrustWarden agents can be verified by any SPIFFE-aware system — not just ours.