SAI API Performance Monitoring by JaiOCP · Pull Request #2279 · opencomputeproject/SAI

JaiOCP · 2026-04-21T20:20:13Z

This PR brings in support for measuring SAI API performance. This is based on presentation done in OCP 2023.

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

rck-innovium

While most of the measurements can be done at the application level, this proposal provides a way to measure the metrics per object operation inside bulk APIs which cannot be done by application level performance monitoring.

rck-innovium · 2026-04-23T15:57:36Z

As discussed, the community concluded that we should not preserve this perfmon data across warmboot (especially since we thought it does not make sense for warm upgrades/ downgrades)

deepak-singhal0408 · 2026-05-07T22:55:39Z

@JaiOCP, could you address the comments? thanks,

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

JaiOCP · 2026-05-08T00:33:32Z

Review comments address. Please take a look @j-bos @rck-innovium

JaiOCP · 2026-05-08T00:34:16Z

@JaiOCP, could you address the comments? thanks,

Commens addressed. Please review

deepak-singhal0408 · 2026-05-14T07:39:38Z

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

kcudnik · 2026-05-14T08:57:07Z

+    /**
+     * @brief SAI Performance Monitoring API set
+     */


this commet is no needed

kcudnik · 2026-05-14T09:00:13Z

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:
Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count
The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

i would assume that create_bulk route is one of the heaviest api to call, and i would guess that other bulk api maybe faster, and readig performance also should be fast, internally it should be just reading/copying a table

JaiOCP · 2026-05-14T19:08:55Z

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:
Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count
The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

HI Deepak,

As we talked about this, computation done this way is a wrong implementation in SAI adapter.
SAI Adapter MUST do as follows:
Call 1: bulk_create(50 routes) → 50µs

Call 2: bulk_create(3000 routes) → 900µs
Call 3: bulk_create(200 routes) → 100µs
Read AVG_LATENCY → (50+900+100)/(50+3000+200) = 0.32µs
Essentially SAI Adapter need to maintain the object count for computing the average till next clear on read.
.

JaiOCP · 2026-05-15T17:31:43Z

@rck-innovium @j-bos Please approve the PR

j-bos · 2026-05-18T17:32:44Z

+
+```
+/*
+ * Configure CSIG Compact Tag for ABW signal processing and time interval of 256 micro seconds


Fix c/p error on comment

thanks good catch

j-bos · 2026-05-18T17:58:09Z

+```
+
+#### 4.3.3 Perfmon Object Switch Binding
+List of perfmon objects can be bound to the switch object. This binding can be done as a SET operation when perfmon object is created.


Section should be updated for latest changes -- may be good just to do an AI pass to update the .md for latest code changes and fix some typos (like the SAI_OBJECT_TYPE_PERFMO$ truncation error below).

Good Idea. I was using cursor and it doesn't do a text validation.
Fixed the section.

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

deepak-singhal0408 · 2026-05-26T05:43:06Z

Hi, is this PR ready for merge? thanks,
cc: @tjchadaga

tjchadaga · 2026-05-27T19:53:20Z

/azp run

azure-pipelines · 2026-05-27T19:53:30Z

Azure Pipelines successfully started running 1 pipeline(s).

tjchadaga · 2026-05-27T20:03:07Z

@JaiOCP - please squash your commits and force push
@rck-innovium - please help complete the review and sign off on this

Signed-off-by: JaiOCP <jai.kumar@broadcom.com> SAI API performance monitoring Signed-off-by: JaiOCP <jai.kumar@broadcom.com> Fix gensairpc.pl crash on Doxygen 1.9.8+ by reusing NeedsTwoPassProcessing (opencomputeproject#2282) Why: To fix below build error Uncaught exception from user code: at gensairpc.pl line 480. main::assign_attr_types(HASH(0x55e190dc20c8), ARRAY(0x55e190d2d080)) called at gensairpc.pl line 434 main::get_definitions() called at gensairpc.pl line 156 main::assign_attr_types(HASH(0x55e190dc20c8), ARRAY(0x55e190d2d080)) called at gensairpc.pl line 434 main::get_definitions() called at gensairpc.pl line 156 How: gensairpc.pl crashed during SAI thrift build with an uncaught exception at line 480 (assign_attr_types) because its inline Doxygen layout detection was too weak - it only checked sai_8h.xml for any enumvalue presence, missing cases where the new Doxygen 1.9.8+ XML structure requires group__*.xml files to be processed for complete definitions. This caused incomplete parsing, leading to missing types and a croak in assign_attr_types when sai_attribute_value_t could not be found. Changes: - xmlutils.pm: Add NeedsTwoPassProcessing and export it. - parse.pl: Remove local NeedsTwoPassProcessing; use imported version. - gensairpc.pl: Replace inline detection with NeedsTwoPassProcessing() call, fixing the build failure and eliminating code duplication. Signed-off-by: Pavan Naregundi <pnaregundi@marvell.com> Count BFD session state changes from UP to DOWN (opencomputeproject#2268) Signed-off-by: Chikkegowda Chikkaiah <cchikkai@cisco.com> HW FRR switchover notification support for protection groups (opencomputeproject#2269) Signed-off-by: Chikkegowda Chikkaiah <cchikkai@cisco.com> Port storm control enhancemnets (opencomputeproject#2258) (opencomputeproject#2258) Signed-off-by: rpmarvell <rperumal@marvell.com>

JaiOCP · 2026-05-27T21:15:02Z

#2287
New PR Opened

SAI API Performance Monitoring

3af7fdd

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

rck-innovium suggested changes Apr 23, 2026

View reviewed changes

Comment thread doc/perfmon/SAI-perfmon-Spec.md

Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated

j-bos reviewed Apr 23, 2026

View reviewed changes

Comment thread inc/saiswitch.h

j-bos reviewed Apr 23, 2026

View reviewed changes

Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated

tjchadaga added the reviewed PR is discussed in SAI Meeting label Apr 28, 2026

JaiOCP added 2 commits May 7, 2026 16:27

Merge branch 'opencomputeproject:master' into perfmon

e6fde37

SAI API Performance Monitoring

dcf7229

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

kcudnik reviewed May 14, 2026

View reviewed changes

deepak-singhal0408 approved these changes May 15, 2026

View reviewed changes

j-bos reviewed May 18, 2026

View reviewed changes

JaiOCP added 2 commits May 21, 2026 15:21

SAI API performance monitoring

96653ea

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

Merge branch 'master' into perfmon

3ce317d

j-bos approved these changes May 22, 2026

View reviewed changes

tjchadaga requested a review from rck-innovium May 27, 2026 19:52

JaiOCP added 2 commits May 27, 2026 14:00

Merge branch 'perfmon' of https://github.com/JaiOCP/SAI into perfmon

eacccf7

JaiOCP mentioned this pull request May 27, 2026

SAI API Performance Monitoring #2287

Open

JaiOCP closed this May 27, 2026

Conversation

JaiOCP commented Apr 21, 2026

Uh oh!

rck-innovium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rck-innovium commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

deepak-singhal0408 commented May 7, 2026

Uh oh!

JaiOCP commented May 8, 2026

Uh oh!

JaiOCP commented May 8, 2026

Uh oh!

deepak-singhal0408 commented May 14, 2026

Uh oh!

kcudnik May 14, 2026

Choose a reason for hiding this comment

Uh oh!

kcudnik commented May 14, 2026

Uh oh!

JaiOCP commented May 14, 2026

Uh oh!

JaiOCP commented May 15, 2026

Uh oh!

j-bos May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JaiOCP May 21, 2026

Choose a reason for hiding this comment

Uh oh!

j-bos May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JaiOCP May 21, 2026

Choose a reason for hiding this comment

Uh oh!

deepak-singhal0408 commented May 26, 2026

Uh oh!

tjchadaga commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

tjchadaga commented May 27, 2026

Uh oh!

JaiOCP commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants