Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion tags/tag-workloads-foundation/subprojects/batch/charter.md
Original file line number Diff line number Diff line change
@@ -1 +1,33 @@
Charter content here
# TAG Workloads Foundation Batch Subproject Charter

## Mission

The cloud-native batch scheduling ecosystem is fragmented β€” different projects tackle job scheduling, queueing, and resource management in incompatible ways. The Batch subproject brings together maintainers and users across the ecosystem to reduce that fragmentation: aligning on common Kubernetes APIs and primitives, developing best practices, and improving outcomes for batch workloads β€” whether HPC, AI/ML, data analytics, or CI β€” in cloud-native environments.

## Scope

### In Scope

To reduce fragmentation in the Kubernetes batch ecosystem: congregate leads and users from different external and internal projects and user groups (CNCF TAGs, Kubernetes sub-projects focused on batch-related features such as topology-aware scheduling) in the batch ecosystem to gather requirements, validate designs and encourage reutilization of core Kubernetes APIs.

The following recommendations for enhancements:

* Additions to the batch API group, currently including Job and CronJob resources that benefit batch use cases such as HPC, AI/ML, data analytics and CI.
* Primitives for job-level queueing, not limited to the Kubernetes Job resource. Long-term, this could include multi-cluster support.
* Primitives to control and maximize utilization of resources in fixed-size clusters (on-prem) and elastic clusters (cloud).
* Benchmarking models for Batch systems
* Data Locality
* User Stories
* Scheduling support for specialized hardware (Accelerators, NUMA, Networking, etc.)

### Out of Scope

* Addition of new API kinds that serve a specialized type of workload. The focus should be on general APIs that specialized controllers can build on top of.
* Uses of the batch APIs as support for serving workloads (eg. backups, upgrades, migrations). These can be served by existing SIGs.
* Proposals that duplicate the functionality of core Kubernetes components (job-controller, kube-scheduler, cluster-autoscaler).
* Job workflows or pipelines. Mature third party frameworks serve these use cases with the current Kubernetes primitives. But additional primitives to support these frameworks could be in scope.

## Deliverables

* **Project Landscape** β€” a living catalogue of batch scheduling projects in the cloud-native ecosystem, maintained at [bsi-landscape.netlify.app](https://bsi-landscape.netlify.app/).
* **Whitepapers and Technical Research** β€” the subproject produces papers and research on topics relevant to cloud-native batch scheduling, such as benchmarking of batch systems, data locality, scheduling best practices, and user stories. An initial series of five whitepapers is complete, with more planned as the space evolves.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Batch Subproject β€” Benchmarking Initiative

This directory contains work from the benchmarking initiative of the [CNCF Batch Subproject](https://tag-workloads-foundation.cncf.io/batch/).

## Overview

The benchmarking initiative develops models, methodologies, and tools for evaluating and comparing batch scheduling systems in cloud-native environments.
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# πŸ“… Jun 6, 2022

## πŸ‘₯ Attendees

- Alex Scammon (G-Research) [host]
- Kevin Hannon (G Research Open Source)
- Dave Gantenbein (G Research)
- Jonathan Skone (LBNL NERSC)
- Diana Arroyo (IBM)
- Nathan Rinni (SchedMD)

## πŸ“‹ Agenda

- Introductions and agenda setting

## Discussion Notes

Introductions and Interests:
Participants introduced themselves and shared their backgrounds and interests in batch processing within a cloud-native framework.
Alex Scammon (G-Research): Leads open-source initiatives; keen on multi-cluster batch scheduling and bridging gaps between infrastructure teams and researchers.
Jonathan Skone (Lawrence Berkeley National Lab): Focuses on emerging technologies to prepare for next-generation system procurement, balancing traditional HPC workflows with cloud-native approaches.
Diana Arroyo (IBM Research): Works on Multi-Cluster App Dispatcher (MCAD) and batch scheduling challenges in multi-cluster and hybrid environments.
Nathan Rini (SchedMD): Maintains SLURM batch scheduler; aims to enhance SLURM’s container support and explore integration with Kubernetes.
Kevin Hannon and Dave Gantenbein (G-Research): Contributors to Armada and related Open Source projects focused on multi-cluster scheduling.
Directive for the Working Group:
Discussion centered on defining the unique role of this WG compared to other groups, such as the Kubernetes Batch WG and the CNCF Research End User Group.
Agreement to focus on:
Identifying and solving key issues in batch processing for Kubernetes and derivatives.
Bridging the divide between infrastructure teams and end users (researchers).
White Paper Development:
Plan to write a white paper addressing:
Key problems with batch processing in Kubernetes and derivatives.
Assessment of tools and solutions available in the ecosystem.
The cultural and operational divide between end users and infrastructure teams.
The white paper will also serve as a baseline for identifying future work and proof-of-concept projects for the WG.
Survey Development:
Proposal to create and distribute a survey to:
Gather insights from end users and infrastructure teams on their batch processing needs and challenges.
Identify common barriers and misalignments between these groups.
Alex Scammon and Diana Arroyo volunteered to collaborate on designing the survey.
Participation Expansion:
Emphasis on inviting broader participation in the WG by reaching out to relevant stakeholders. Suggestions included:
Klaus Ma (Volcano)
Participants from San Diego Supercomputing Center and Google.
Representatives from vendors like IBM/Red Hat and Cray/HPE.
Meeting Time Adjustment:
Proposal to move meeting time from 7:30 AM PDT to 8:00 AM PDT to better accommodate participants.
Alex will confirm with Volcano participants (Asia-based) regarding their intentions to join and the feasibility of the time adjustment.
Other Key Discussion Points:
Cultural Divide:
Agreement that researchers often resist change and prefer established tools (e.g., SLURM), whereas infrastructure teams focus on long-term maintainability and scalability (e.g., Kubernetes).
Challenge of creating solutions that balance these competing priorities while improving usability and flexibility.
Infrastructure Observations:
Jonathan described NERSC’s move to Kubernetes for managing control planes and enterprise services, but noted that only ~25% of workloads are containerized.
Nathan highlighted that SLURM remains widely used, with some researchers embedding it into Kubernetes clusters to maintain traditional workflows.
Potential Collaboration Areas:
Standardizing workflows across schedulers and reducing fragmentation in Kubernetes batch solutions (e.g., Armada, Volcano, MCAD).
Exploring proof-of-concept projects after the white paper to develop unified approaches or tooling for batch processing in cloud-native environments.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# πŸ“… Jun 20, 2022

## πŸ‘₯ Attendees

- Alex Scammon (G Research)
- Diana Arroyo (IBM)
- Nathan Rinni (SchedMD)
- C. Rindi (G Research)
- Klaus Ma (Nvidia)
- Jonathan Skone (LBNL NERSC)

## Discussion Notes

Batch Survey
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# πŸ“… Jul 18, 2022

## πŸ‘₯ Attendees

- Alex Scammon (G-Research)
- Kevin Hannon (G-Research)
- Diana Arroyo (IBM)
- Jonathan Skone (NERSC LBNL)
- Weiwei Yang (Apple)

## πŸ“‹ Agenda

- Updates:
- Cleanup of our docs
- Applied to have a Cloud Native Community Group presence
- Go over the latest Batch Survey ideas
- Discuss the β€œworking” of the this working group
- Kevin: understands k8s batch working group; unsure of research user group
- Jamie:
- Research user group is CNCF tech for research institutions
- Talks about batch a lot, but not exclusively. Notebooks; security implications
- Mostly focused around k8s but it’s focused on the type of person invited. Generally from Academia or other β€œresearch” institutions
- Jonathan:
- Thought k8s batch was going to be focused on working on fundamental improvements to k8s for batch
- K8s batch has started to include more education/information sharing
- Someone spoke on Slurm w/ k8s glasses on at the last k8s batch
- Diana:
- K8s batch started on focusing on Kueue
- As discussions evolved, wanted to identify patterns
- K8s batch currently surveying the landscape to move forward on the lower level improvements
- Kevin:
- Agreed with Diana: understanding the use-cases and common problems so that the underlying architecture improvements will be correct
- Diana:
- Alex:
- Back to the original question: what β€œwork” are ew up for in this working group?
- Kevin:
- Feels that k8s isn’t in the toolbox for researchers and HPC community
- Maybe focus on β€œold-world” schedulers
- Reach out to the non-k8s community
- From work at NIH, maybe reaching out to bioinformatics community
- Like Galaxy community
- CubeGene on Volcano, perhaps?
- Diana:
- Good idea to take a specific feature of batch schedulers and see how that feature differs
- Jonathan:
- Likes the idea for this group to phase itself out over time
- Not worry too much about creating tasks up front; let the conversation evolve
- Be a beacon of recommendations for batch in CNCF
- Wei:
- Finding the discussion helpful; looking for clarity around all these working groups
- More Outreach ideas
- Alex:
- Kevin, give me names from bioinformatics, etc.
- Jonathan:
- Has access to researchers; has a broad task to understand their needs
- Target the High-Energy physicists community? Perhaps Atlas group?
- Target infra side of the house in research
- Alex: Maybe ask Ricardo to send some folks?
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# πŸ“… Aug 1, 2022

## πŸ‘₯ Attendees

- Alex Scammon (G-Research)
- Jorge Vargas (Thermo Fisher Scientific)
- Diana Arroyo (IBM)
- Kevin Hannon (G-Research)
- Abhishek Malvankar (IBM)
- Michel Sumbul (G-Research)
- Dave Gantenbein (G-Research)
- Jonathan Skone (NERSC)

## πŸ“‹ Agenda

- News and updates
- Armada Sandboxed
- Survey Update
- Targeting Infra Community
- Academic services
- Sidecar services:
- Infra groups that set up long-lived enterprisey services
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# πŸ“… Aug 15, 2022

## πŸ‘₯ Attendees

- Abhishek Malvankar (IBM)
- Alex Scammon (GR Open Source)
- Dave Gantenbein (GR Open Source)
- Diana Arroyo (IBM)

## Discussion Notes

Survey update, CNCF access
more
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# πŸ“… Sep 26, 2022

## πŸ‘₯ Attendees

- Alex Scammon (G-Research)
- Kevin Hannon (G-Research)
- Abhishek Malvankar (IBM)
- Dave Gantenbein (G-Research)
- Jonathan Skone (NERSC)

## πŸ“‹ Agenda

- Ideas from Kubecon for potential presenters?
- Kevin: How do HPC schedulers support cloud?
- https://community.cncf.io/events/details/cncf-research-end-user-group-presents-cncf-research-end-user-group-oci-containers-with-scrun-nate-rini-schedmd-2022-10-05/
- Nate:
- https://www.youtube.com/watch?v=7y7IpCTj5mk&t=1s
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# πŸ“… Oct 10, 2022

## πŸ‘₯ Attendees

- Kevin Hannon (G-Research)
- Nathan Rini (SchedMd)
- Alex Scammon (G-Research)
- Dave Gantenbein (G-Research)
- Abhishek Malvankar (IBM)
- Jonathan Skone (NERSC)

## πŸ“‹ Agenda

- Survey Updates
- Need more options and help
- Nathan Rini: Slurm Container Presentation
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# πŸ“… Nov 7, 2022

## πŸ‘₯ Attendees

- (G-Research Open Source)
- (G-Research Open Source)
- Jonathan Skone (NERSC)
- Alex Scammon (G-Research Open Source)
- Nate Rini (SchedMD)

## Discussion Notes

Kevin presents β€œInteractive Jobs in Kubernetes”
https://docs.google.com/document/d/1-kiduaazR9-04_pcoUJE4zECNVYnar0RtfJezUBBwzM/edit#
Batch WG Mandate
Survey Updates
Other Kubecon Takeaways
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# πŸ“… Nov 21, 2022

## Discussion Notes

(Notes forthcoming)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# πŸ“… Dec 19, 2022

## πŸ‘₯ Attendees

- Kevin Hannon (G-Research Open Source)
- Alex Scammon (G-Research Open Source)
- Jonathan Skone (NERSC)
- Nate Rini (SchedMD)

## Discussion Notes

(Cats)
Hardware, roadmaps, Kubernetes
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# πŸ“… Jan 30, 2023

## πŸ‘₯ Attendees

- Nate Rini (SchedMD)
- Kevin Hannon (G-Research Open Source)
- Alex Scammon (G-Research Open Source)
- Diana Arroyo (IBM)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# πŸ“… Feb 13, 2023

## Discussion Notes

(notes forthcoming)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# πŸ“… Feb 27, 2023

## Discussion Notes

(notes forthcoming)
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# πŸ“… Mar 13, 2023

## πŸ‘₯ Attendees

- Alex Scammon (GR Open Source)
- Tim Middelkoop (Internet2)
- Nate Rini (SchedMD)
- Jonathan Skone (NERSC)
- Dave Gantenbein (GR Open Source)
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# πŸ“… Mar 27, 2023

## πŸ‘₯ Attendees

- Alex Scammon
- Caterina Rindi
- Diana Arroyo
- Jonathan Skone
- Kevin Hannon
- Nate Rini
- Matthew West
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# πŸ“… Apr 10, 2023

## πŸ‘₯ Attendees

- Kevin Hannon (G-Research Open Source)
- Caterina Rindi (G-Research Open Source)
- Alex Scammon (G-Research Open Source)
- Nate Rini (SchedMD)
- Jonathan Skone (NERSC)
- Diana Arroyo (IBM)
- Dave Gantenbein (G-Research Open Source)
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# πŸ“… Apr 24, 2023

## πŸ‘₯ Attendees

- Jonathan Skone (NERSC)
- Nate Rini (SchedMD)
- Diana Arroyo (IBM)
- Caterina Rindi (G-Research Open Source)
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# πŸ“… May 8, 2023

## πŸ‘₯ Attendees

- Kevin Hannon (G-Research Open Source)
- Jonathan Skone (NERSC)
- Dianna Arroyo (IBM)
- Alex Scammon (G-Research Open Source)
- Dave Ganteinbeim (G-Research Open Source)

## πŸ“‹ Agenda

- KubeCon-EU debrief
- NERSC RFP (https://www.nersc.gov/systems/nersc-10/draft-tech-req/)
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# πŸ“… May 22, 2023

## πŸ‘₯ Attendees

- Alex Scammon (G-Research Open Source)
- Caterina Rindi (G-Research Open Source)
- Nate Rini (SchedMD)
- Jonathan Skone (NERSC)
- Matthew West

## πŸ“‹ Agenda

- Batch Landscape Discussions
- Three Possible Landscape Venues:
- https://github.com/lfai/lfai-landscape
- https://github.com/cncf/landscape
- https://github.com/cncf/tag-runtime/tree/main
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# πŸ“… Sep 11, 2023

## πŸ‘₯ Attendees

- Alex Scammon (G-Research)

## Discussion Notes

Housecleaning
Fixed calendar invites on:
https://docs.google.com/document/d/1mIhSTOa5bQWHY9oAIopzpy1w-hACm_ewe7fzMZ30Z6c/edit
https://docs.google.com/document/d/1GuZGyBkRGG0lEeiPA8q0PfvFlwUlwa5k-ZfXafCTdBY/edit#heading=h.63y814c3aujl
https://github.com/cncf/tag-runtime/blob/main/wg/bsi.md
https://github.com/cncf/tag-runtime/pull/76
Loading
Loading