Skip to content

SAI API Performance Monitoring#2279

Closed
JaiOCP wants to merge 7 commits into
opencomputeproject:masterfrom
JaiOCP:perfmon
Closed

SAI API Performance Monitoring#2279
JaiOCP wants to merge 7 commits into
opencomputeproject:masterfrom
JaiOCP:perfmon

Conversation

@JaiOCP
Copy link
Copy Markdown
Contributor

@JaiOCP JaiOCP commented Apr 21, 2026

This PR brings in support for measuring SAI API performance. This is based on presentation done in OCP 2023.

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>
Copy link
Copy Markdown
Contributor

@rck-innovium rck-innovium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While most of the measurements can be done at the application level, this proposal provides a way to measure the metrics per object operation inside bulk APIs which cannot be done by application level performance monitoring.

Comment thread doc/perfmon/SAI-perfmon-Spec.md
Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated
@rck-innovium
Copy link
Copy Markdown
Contributor

As discussed, the community concluded that we should not preserve this perfmon data across warmboot (especially since we thought it does not make sense for warm upgrades/ downgrades)

Comment thread inc/saiswitch.h
Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated
@tjchadaga tjchadaga added the reviewed PR is discussed in SAI Meeting label Apr 28, 2026
@deepak-singhal0408
Copy link
Copy Markdown

@JaiOCP, could you address the comments? thanks,

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 8, 2026

Review comments address. Please take a look @j-bos @rck-innovium

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 8, 2026

@JaiOCP, could you address the comments? thanks,

Commens addressed. Please review

@deepak-singhal0408
Copy link
Copy Markdown

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

Comment thread inc/saiperfmon.h
Comment on lines +178 to +180
/**
* @brief SAI Performance Monitoring API set
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this commet is no needed

@kcudnik
Copy link
Copy Markdown
Collaborator

kcudnik commented May 14, 2026

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

i would assume that create_bulk route is one of the heaviest api to call, and i would guess that other bulk api maybe faster, and readig performance also should be fast, internally it should be just reading/copying a table

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 14, 2026

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

HI Deepak,

As we talked about this, computation done this way is a wrong implementation in SAI adapter.
SAI Adapter MUST do as follows:
Call 1: bulk_create(50 routes) → 50µs

Call 2: bulk_create(3000 routes) → 900µs
Call 3: bulk_create(200 routes) → 100µs
Read AVG_LATENCY → (50+900+100)/(50+3000+200) = 0.32µs
Essentially SAI Adapter need to maintain the object count for computing the average till next clear on read.
.

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 15, 2026

@rck-innovium @j-bos Please approve the PR

Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated

```
/*
* Configure CSIG Compact Tag for ABW signal processing and time interval of 256 micro seconds
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix c/p error on comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks good catch

Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated
```

#### 4.3.3 Perfmon Object Switch Binding
List of perfmon objects can be bound to the switch object. This binding can be done as a SET operation when perfmon object is created.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section should be updated for latest changes -- may be good just to do an AI pass to update the .md for latest code changes and fix some typos (like the SAI_OBJECT_TYPE_PERFMO$ truncation error below).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Idea. I was using cursor and it doesn't do a text validation.
Fixed the section.

JaiOCP added 2 commits May 21, 2026 15:21
Signed-off-by: JaiOCP <jai.kumar@broadcom.com>
@deepak-singhal0408
Copy link
Copy Markdown

Hi, is this PR ready for merge? thanks,
cc: @tjchadaga

@tjchadaga tjchadaga requested a review from rck-innovium May 27, 2026 19:52
@tjchadaga
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@tjchadaga
Copy link
Copy Markdown
Collaborator

@JaiOCP - please squash your commits and force push
@rck-innovium - please help complete the review and sign off on this

JaiOCP added 2 commits May 27, 2026 14:00
Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

SAI API performance monitoring

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>

Fix gensairpc.pl crash on Doxygen 1.9.8+ by reusing NeedsTwoPassProcessing (opencomputeproject#2282)

Why:
To fix below build error

Uncaught exception from user code:
	 at gensairpc.pl line 480.
		main::assign_attr_types(HASH(0x55e190dc20c8), ARRAY(0x55e190d2d080)) called at gensairpc.pl line 434
		main::get_definitions() called at gensairpc.pl line 156
	main::assign_attr_types(HASH(0x55e190dc20c8), ARRAY(0x55e190d2d080)) called at gensairpc.pl line 434
	main::get_definitions() called at gensairpc.pl line 156

How:
gensairpc.pl crashed during SAI thrift build with an uncaught exception
at line 480 (assign_attr_types) because its inline Doxygen layout
detection was too weak - it only checked sai_8h.xml for any enumvalue
presence, missing cases where the new Doxygen 1.9.8+ XML structure
requires group__*.xml files to be processed for complete definitions.
This caused incomplete parsing, leading to missing types and a croak
in assign_attr_types when sai_attribute_value_t could not be found.

Changes:
- xmlutils.pm: Add NeedsTwoPassProcessing and export it.
- parse.pl: Remove local NeedsTwoPassProcessing; use imported version.
- gensairpc.pl: Replace inline detection with NeedsTwoPassProcessing()
  call, fixing the build failure and eliminating code duplication.

Signed-off-by: Pavan Naregundi <pnaregundi@marvell.com>

Count BFD session state changes from UP to DOWN (opencomputeproject#2268)

Signed-off-by: Chikkegowda Chikkaiah <cchikkai@cisco.com>

HW FRR switchover notification support for protection groups (opencomputeproject#2269)

Signed-off-by: Chikkegowda Chikkaiah <cchikkai@cisco.com>

Port storm control enhancemnets (opencomputeproject#2258) (opencomputeproject#2258)

Signed-off-by: rpmarvell <rperumal@marvell.com>
@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 27, 2026

#2287
New PR Opened

@JaiOCP JaiOCP closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reviewed PR is discussed in SAI Meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants