Intermittent IAM authentication failures in serverless environments

> [!NOTE]
> I used Claude Opus 4.8 to debug the issue and produce the below writeup.  
> However, I did read every word and edited it before submitting 😄

## Bug Description

Intermittent IAM authentication failures (`ER_ACCESS_DENIED_ERROR`, errno **1045**, "using password: YES") on the **first query of an invocation** when the connector runs on **Cloud Run with [request-based billing](https://docs.cloud.google.com/run/docs/configuring/billing-settings)** (CPU throttled to ~0 between requests, no min-instances) and the service is invoked on a schedule (~every 5 minutes).

The vast majority of invocations succeed. Failures **cluster** for a stretch and then **self-resolve** without any deploy or config change.

My working theory: with [automatic IAM database authentication](https://docs.cloud.google.com/sql/docs/mysql/iam-authentication#automatic) the connector uses an [OAuth 2.0 access token as the password](https://docs.cloud.google.com/sql/docs/mysql/iam-authentication#automatic) — which is "short-lived and valid for only one hour" — plus the ephemeral client cert, and [refreshes both on a background timer](https://docs.cloud.google.com/sql/docs/mysql/iam-authentication#automatic) ("Cloud SQL connectors are able to request and refresh these tokens"). Under [request-based billing](https://docs.cloud.google.com/run/docs/configuring/billing-settings#cpu-allocation), CPU is only allocated during request processing, and the docs note that running Node.js async/background work outside of a request [requires instance-based billing](https://docs.cloud.google.com/run/docs/configuring/billing-settings#background-execution-considerations) — so that refresh timer is effectively **suspended during the idle gap** between scheduled invocations. For IAM, the refresh appears to be scheduled close to token expiry (~`duration − 4min`), so when an idle gap straddles that window the next request reuses a now-**expired token**. The connect path (`stream()`, around connector.ts:226-265 in 1.10.0) appears to reuse the cached cert/token **without an expiry check**, so the expired token reaches the MySQL handshake → 1045. Because this is a **post-handshake** auth rejection rather than a TLS error, it doesn't trigger the `tlsSocket.once('error', () => forceRefresh())` recovery path, so it doesn't self-heal until a later scheduled refresh happens to run while CPU is allocated.

One possible direction: the [Python connector](https://docs.cloud.google.com/sql/docs/mysql/iam-logins#expandable-1) supports a lazy refresh strategy (`Connector(refresh_strategy="LAZY")`) that validates/refreshes credentials on demand at connect time. An equivalent on-demand freshness check in the Node connector (or a public force-refresh API) would cover environments where background timers can't be relied upon.

## Example code (or command)

```js
const { Connector, AuthTypes } = require('@google-cloud/cloud-sql-connector');
const mysql = require('mysql2/promise');

const connector = new Connector();
const opts = await connector.getOptions({
  instanceConnectionName: 'PROJECT:REGION:INSTANCE',
  authType: AuthTypes.IAM,
});

const pool = mysql.createPool({
  ...opts,
  user: 'my-service-account', // IAM principal (no @domain)
  database: 'my_db',
});

// Pool is created once and cached across invocations. On Cloud Run with
// request-based billing the process is frozen between requests, so the
// connector's background refresh timer does not run during idle gaps.
// On a later scheduled invocation, the first query intermittently fails:
const [rows] = await pool.query('SELECT 1');
```

## Stacktrace

```
Error: Access denied for user '<service-account>'@'cloudsqlproxy~<ip>' (using password: YES)
    at Packet.asError (.../node_modules/mysql2/lib/packets/packet.js)
    at ClientHandshake.execute (.../node_modules/mysql2/lib/commands/command.js)
    at Connection.handlePacket (.../node_modules/mysql2/lib/connection.js)
  code: 'ER_ACCESS_DENIED_ERROR',
  errno: 1045,
  sqlState: '28000'
```

## How to reproduce

1. Deploy a Node.js service to Cloud Run with **[request-based billing](https://docs.cloud.google.com/run/docs/configuring/billing-settings)** (CPU throttled between requests; no min-instances), using the connector with [`AuthTypes.IAM` and a cached pool](https://docs.cloud.google.com/sql/docs/mysql/iam-logins#expandable-1).
2. Invoke it on a schedule (~every 5 minutes) so there's an idle gap between requests during which CPU is throttled.
3. Over time, the first query of an invocation intermittently fails with errno 1045; failures cluster and then self-resolve.

#### Environment details

- OS: Cloud Run (Linux container), request-based billing, no min-instances
- Node.js version: 24
- `@google-cloud/cloud-sql-connector` version: 1.10.0
- Database: Cloud SQL for MySQL, IAM database authentication, via `mysql2`

#### Steps to reproduce

1. See "How to reproduce" above.

## Workaround I'm using

Tearing down and recreating the connector each invocation (`close()` + recreate) so the first connection mints fresh credentials. This is OK for my use case of cron jobs since they only run every few minutes and I'm okay with a small penalty instead of turning on CPU always allocated which will be more expensive ($ wise).he 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent IAM authentication failures in serverless environments #585

Bug Description

Example code (or command)

Stacktrace

How to reproduce

Environment details

Steps to reproduce

Workaround I'm using

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent IAM authentication failures in serverless environments #585

Description

Bug Description

Example code (or command)

Stacktrace

How to reproduce

Environment details

Steps to reproduce

Workaround I'm using

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions