Skip to content

Intermittent IAM authentication failures in serverless environments #585

Description

@kym6464

Note

I used Claude Opus 4.8 to debug the issue and produce the below writeup.
However, I did read every word and edited it before submitting 😄

Bug Description

Intermittent IAM authentication failures (ER_ACCESS_DENIED_ERROR, errno 1045, "using password: YES") on the first query of an invocation when the connector runs on Cloud Run with request-based billing (CPU throttled to ~0 between requests, no min-instances) and the service is invoked on a schedule (~every 5 minutes).

The vast majority of invocations succeed. Failures cluster for a stretch and then self-resolve without any deploy or config change.

My working theory: with automatic IAM database authentication the connector uses an OAuth 2.0 access token as the password — which is "short-lived and valid for only one hour" — plus the ephemeral client cert, and refreshes both on a background timer ("Cloud SQL connectors are able to request and refresh these tokens"). Under request-based billing, CPU is only allocated during request processing, and the docs note that running Node.js async/background work outside of a request requires instance-based billing — so that refresh timer is effectively suspended during the idle gap between scheduled invocations. For IAM, the refresh appears to be scheduled close to token expiry (~duration − 4min), so when an idle gap straddles that window the next request reuses a now-expired token. The connect path (stream(), around connector.ts:226-265 in 1.10.0) appears to reuse the cached cert/token without an expiry check, so the expired token reaches the MySQL handshake → 1045. Because this is a post-handshake auth rejection rather than a TLS error, it doesn't trigger the tlsSocket.once('error', () => forceRefresh()) recovery path, so it doesn't self-heal until a later scheduled refresh happens to run while CPU is allocated.

One possible direction: the Python connector supports a lazy refresh strategy (Connector(refresh_strategy="LAZY")) that validates/refreshes credentials on demand at connect time. An equivalent on-demand freshness check in the Node connector (or a public force-refresh API) would cover environments where background timers can't be relied upon.

Example code (or command)

const { Connector, AuthTypes } = require('@google-cloud/cloud-sql-connector');
const mysql = require('mysql2/promise');

const connector = new Connector();
const opts = await connector.getOptions({
  instanceConnectionName: 'PROJECT:REGION:INSTANCE',
  authType: AuthTypes.IAM,
});

const pool = mysql.createPool({
  ...opts,
  user: 'my-service-account', // IAM principal (no @domain)
  database: 'my_db',
});

// Pool is created once and cached across invocations. On Cloud Run with
// request-based billing the process is frozen between requests, so the
// connector's background refresh timer does not run during idle gaps.
// On a later scheduled invocation, the first query intermittently fails:
const [rows] = await pool.query('SELECT 1');

Stacktrace

Error: Access denied for user '<service-account>'@'cloudsqlproxy~<ip>' (using password: YES)
    at Packet.asError (.../node_modules/mysql2/lib/packets/packet.js)
    at ClientHandshake.execute (.../node_modules/mysql2/lib/commands/command.js)
    at Connection.handlePacket (.../node_modules/mysql2/lib/connection.js)
  code: 'ER_ACCESS_DENIED_ERROR',
  errno: 1045,
  sqlState: '28000'

How to reproduce

  1. Deploy a Node.js service to Cloud Run with request-based billing (CPU throttled between requests; no min-instances), using the connector with AuthTypes.IAM and a cached pool.
  2. Invoke it on a schedule (~every 5 minutes) so there's an idle gap between requests during which CPU is throttled.
  3. Over time, the first query of an invocation intermittently fails with errno 1045; failures cluster and then self-resolve.

Environment details

  • OS: Cloud Run (Linux container), request-based billing, no min-instances
  • Node.js version: 24
  • @google-cloud/cloud-sql-connector version: 1.10.0
  • Database: Cloud SQL for MySQL, IAM database authentication, via mysql2

Steps to reproduce

  1. See "How to reproduce" above.

Workaround I'm using

Tearing down and recreating the connector each invocation (close() + recreate) so the first connection mints fresh credentials. This is OK for my use case of cron jobs since they only run every few minutes and I'm okay with a small penalty instead of turning on CPU always allocated which will be more expensive ($ wise).he

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions