Skip to content

Pooling#2345

Open
KoalaGeo wants to merge 5 commits into
geopython:masterfrom
KoalaGeo:pooling
Open

Pooling#2345
KoalaGeo wants to merge 5 commits into
geopython:masterfrom
KoalaGeo:pooling

Conversation

@KoalaGeo
Copy link
Copy Markdown
Contributor

Overview

Makes the SQLAlchemy connection pool of the SQL provider configurable per provider via the existing options: block, exposing pool_size, max_overflow, pool_recycle, pool_timeout and pool_pre_ping.

Previously get_engine() called create_engine(conn_str, connect_args=connect_args, pool_pre_ping=True) with no pool sizing or recycle, so the default QueuePool held pool_size connections open for the life of each worker process and never recycled them. In multi-process deployments this produces a large number of permanently-IDLE server-side connections (we saw connections idle for days, eventually exhausting max_connections). There was no way to bound or recycle the pool from configuration.

Changes:

  • store_db_parameters() now extracts the five pool keys from options, coerces them to their declared types, and stores them as a sorted, hashable tuple (self.db_pool_options). They are popped out of options so they are not forwarded to the DBAPI as connect_args.
  • get_engine() takes a pool_options tuple parameter and applies **dict(pool_options) to create_engine(). It stays @functools.cache-able because the parameter is a hashable tuple, so engine sharing per process is preserved; providers with differing pool config correctly get distinct engines.
  • pygeoapi/process/manager/postgresql.py also calls get_engine(); its call site is updated to pass self.db_pool_options so the manager does not lose pool_pre_ping or skip recycling.

Backward compatibility: defaults preserve current behaviour exactly — pool_size=5, max_overflow=10, pool_pre_ping=True, and pool_recycle=-1 (SQLAlchemy's default, i.e. the current effective behaviour).

This PR is therefore a pure, opt-in feature add with no behaviour change for existing users. (See the issue for discussion of whether a finite default pool_recycle should be adopted as a separate follow-up.)

New tests and documentation are included.

Related Issue / discussion

Closes #2344.

Additional information

Example configuration:

providers:
  - type: feature
    name: PostgreSQL
    data:
      host: 127.0.0.1
      port: 5432
      dbname: test
      user: postgres
      password: postgres
      search_path: [osm, public]
    options:
      pool_size: 2          # persistent connections per worker process
      max_overflow: 3       # short-lived burst capacity
      pool_recycle: 300     # recycle connections older than 5 minutes
      pool_timeout: 30
    id_field: osm_id
    table: hotosm_bdi_waterways
    geom_field: foo_geom

Note (documented): because get_engine() is @functools.cache-d on its full argument set, providers that share a database must use identical pool options to continue sharing a single engine per worker; differing options intentionally yield separate engines.

Dependency policy (RFC2)

  • I have ensured that this PR meets RFC2 requirements

No new dependencies are introduced; only the standard library and the already-required SQLAlchemy are used.

Updates to public demo

  • I have ensured that breaking changes to the pygeoapi master demo server have been addressed
  • No changes required: defaults preserve existing behaviour, so the demo local.config.yml does not need to change.

Contributions and licensing

  • I'd like to contribute a bugfix/feature (configurable SQL connection pool) to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
  • I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines

KoalaGeo added 5 commits May 19, 2026 14:56
Added connection pool options for SQL Alchemy engine.
Change pool_recycle to -1 to preserve current behavior.
Added SQLAlchemy connection-pool tuning options to configuration.
test_sql_pool_options.py exercises `store_db_parameters()` directly, requires no database, and runs in standard CI. It asserts the zero-behaviour-change defaults, override + typing, no DBAPI leakage, the existing dict-filtering, hashable/deterministic cache keys, and coexistence with search_path.
@tomkralidis tomkralidis added this to the 0.24.0 milestone May 20, 2026
@webb-ben
Copy link
Copy Markdown
Member

Is there a reason to not pop the attributes from the connect_args inside of get_engine? This would consolidate a bit of the complications noted in the PR between hashing and the manager using get_engine. Maybe I am missing something

Copy link
Copy Markdown
Member

@ricardogsilva ricardogsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving my two cents here - I'm not a core committer so take these with a grain of salt.

Overall I agree with the PR, as adding these connection-related options seems relevant - thanks for your work and I look forward to having it merged!

Personally, I would simplify the implementation a bit, by relying on pygeoapi's JSON Schema document for the validation of the config.

And I would not include most of these tests, which I see as not being relevant.

Comment thread pygeoapi/provider/sql.py
Comment on lines +622 to +624
# Defaults keep SQLAlchemy's QueuePool sizing but, unlike SQLAlchemy's
# default of -1, recycle connections after an hour so that pooled
# connections cannot sit IDLE on the server indefinitely.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the comment seems to be outdated, as you end up setting the default value of pool_recycle to -1

Comment on lines +706 to +719
# SQLAlchemy connection-pool tuning (optional). Defaults match
# SQLAlchemy's QueuePool and preserve previous behaviour.
# Persistent connections held open per worker process.
pool_size: 5
# Extra short-lived connections allowed above pool_size.
max_overflow: 10
# Recreate connections older than this many seconds. -1 (the
# default) never recycles; set a finite value (e.g. 300) so
# pooled connections cannot sit IDLE on the server indefinitely.
pool_recycle: -1
# Seconds to wait for a connection from the pool before erroring.
pool_timeout: 30
# Test connections with a lightweight ping before use.
pool_pre_ping: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these new parameters need to be added to the config schema at

pygeoapi/resources/schemas/config/pygeoapi-config-0.x.yml

This will make it possible to test a pygeoapi configuration for correctness even before starting up the server.

Comment thread pygeoapi/provider/sql.py
(key, type(default)(options.pop(key, default)))
for key, default in pool_defaults.items()
))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this could be made easier to read and less complex by:

  • Storing self.db_pool_options as a dict instead of a tuple, and defer tuple creation to when get_engine is called;
  • Relying on the types of passed options already being correct. Adding these new parameters to the config JSON Schema (as I mentioned in my other comment) would mean that the type of each parameter would already be documented and would be enforceable by doing a validation of the config.

Also, note that your implementation contains a subtle bug when trying to parse pool_pre_ping. If the original value had been:

{'pool_pre_ping': 'False'}  # I'm passing a string with the value of "False"

then the outcome would be:

# type(True)("False")
True

In other words, bool("False") is actually True because non-empty strings are truthy.

-pool_defaults = {
-    'pool_size': 5,
-    'max_overflow': 10,
-    'pool_recycle': -1,   # SQLAlchemy default; preserves current behaviour
-    'pool_timeout': 30,
-    'pool_pre_ping': True,
-}
-self.db_pool_options = tuple(sorted(
-    (key, type(default)(options.pop(key, default)))
-    for key, default in pool_defaults.items()
-))
+self.pool_defaults = {
+    'pool_size': options.pop('pool_size', 5),
+    'max_overflow': options.pop('max_overflow', 10),
+    'pool_recycle': options.pop('pool_recycle', -1),  # SQLALchemy default - never release connections
+    'pool_timeout': options.pop('pool_timeout', 30),
+    'pool_pre_ping': options.pop('pool_pre_ping', True),
+}

Comment thread pygeoapi/provider/sql.py
self.db_user,
self._db_password,
self.db_conn,
self.db_pool_options,
Copy link
Copy Markdown
Member

@ricardogsilva ricardogsilva May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per my other comment, in my opinion it would be clearer if the tuple would be generated here, perhaps also accompanied with a comment mentioning that this is made as a way to enable making use of functools.cache.

Also, in modern Python, a dict's insertion ordering is preserved, so I don't think sorting the tuple would be needed.

Suggested change
self.db_pool_options,
tuple(self.db_pool_options.items()), # convert to hashable type, for using with functools.cache

Comment on lines +20 to +30
def test_pool_options_defaults_preserve_current_behaviour():
obj = _Dummy()
store_db_parameters(obj, dict(CONN), {})
pool = dict(obj.db_pool_options)
# Defaults must match pre-existing effective behaviour:
# pool_pre_ping was hardcoded True; pool_recycle was unset (-1).
assert pool['pool_size'] == 5
assert pool['max_overflow'] == 10
assert pool['pool_timeout'] == 30
assert pool['pool_pre_ping'] is True
assert pool['pool_recycle'] == -1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems unnecessary to me - when this PR gets merged, the behavior it implements will become the current behavior, so the test looses its relevancy.

Comment on lines +33 to +45
def test_pool_options_are_overridable_and_typed():
obj = _Dummy()
store_db_parameters(
obj, dict(CONN),
{'pool_size': 2, 'max_overflow': 3, 'pool_recycle': 300},
)
pool = dict(obj.db_pool_options)
assert pool['pool_size'] == 2 and isinstance(pool['pool_size'], int)
assert pool['max_overflow'] == 3
assert pool['pool_recycle'] == 300
# untouched keys keep defaults
assert pool['pool_timeout'] == 30
assert pool['pool_pre_ping'] is True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would be unnecessary if you'd go with my suggestion above, of storing db_pool_options as a dict instead of a tuple and you'd rely on the configuration being valid after having added the JSON schema bits that are missing.

Comment on lines +61 to +69
def test_dict_valued_options_still_filtered():
obj = _Dummy()
store_db_parameters(
obj, dict(CONN),
{'pool_size': 2, 'zoom': {'min': 0, 'max': 22}},
)
assert 'zoom' not in obj.db_options
assert dict(obj.db_pool_options)['pool_size'] == 2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to be unnecessary, as it is not testing the changes you made in this PR. It verifies that the contents of obj.db_options are correct.

IMO this PR does not make any changes that would warrant this verification, unless you would not trust the behavior of dict.pop, which is a Python builtin.

Comment on lines +71 to +83
def test_pool_options_hashable_and_deterministic():
a, b = _Dummy(), _Dummy()
store_db_parameters(a, dict(CONN), {'pool_size': 2})
store_db_parameters(b, dict(CONN), {'pool_size': 2})
# identical config -> identical key -> shared engine via functools.cache
assert a.db_pool_options == b.db_pool_options
assert hash(a.db_pool_options) == hash(b.db_pool_options)

c = _Dummy()
store_db_parameters(c, dict(CONN), {'pool_size': 9})
# differing pool config -> distinct key (separate engine, by design)
assert c.db_pool_options != a.db_pool_options

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing Python's own implementation of how tuples are hashed, so I don't think it is relevant to include in pygeoapi.

Comment on lines +85 to +93
def test_pool_options_coexist_with_search_path():
obj = _Dummy()
store_db_parameters(
obj, dict(CONN),
{'search_path': ['published', 'public'], 'pool_size': 4},
)
assert obj.db_search_path == ('published', 'public')
assert dict(obj.db_pool_options)['pool_size'] == 4

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems unnecessary, as it does not test the functionality introduced in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL/SQL provider connection pool is not configurable and never recycles

4 participants