AI DCN

IETF

Meetings

1. AI DC Side Meeting at IETF 118

Time: 17:00 - 19:00 Tuesday, November 7, 2023 (Central European Time - Prague)

Chairs: Jeff Tantsura (NVIDIA) Yingzhen Qu (Futurewei)

Recording: webex video

Materials: https://github.com/Yingzhen-ietf/AIDC-IETF118/

Agenda:

17:10 Networking in AI – Omer Shabtai (Nvidia)

17:35 Astral-Network: efficient large-scale datacenter network for large language model training – Baojia Li (Tencent)

17:55 Self-Adjusting Networks – Stefan Schmid (TU Berlin)

18:15 CSIG - Simple and Effective In-band Network Signals for Efficient Traffic Management in Datacenter Networks – Abhiram Ravi (Google)

2. AI DC Side Meeting at IETF 117

Time: 15:30 - 17:00 Monday, July 24 2023

Chairs:

Jeff Tantsura (jefftant.ietf@gmail.com)
Yingzhen Qu (yingzhen.ietf@gmail.com)

Materials: https://github.com/Yingzhen-ietf/AIDC-IETF117

Agenda:

15:35 Networking in AI/ML Cluster – Barak Gafni (Nvidia) (20 mins)

15:55 Routing in Dragonfly Topology – Dmitry Afanasiev (20 mins)

16:15 AsterlNetwork: efficient large-scale datacenter network for AI/ML – Baojia Li (Tencent) (10 mins)

16:25 DC Routing – Tony Przygienda (Juniper) (10 mins)

16:35 New Requirements and Thoughts for AI Data Center Networks - From a service provider’s perspective – Weiqiang Cheng (China Mobile) (10 mins)

Drafts

1. Application-aware Data Center Network (APDN) Use Cases and Requirements

Publication URL: https://datatracker.ietf.org/doc/draft-wh-rtgwg-application-aware-dc-network/

Introduction:

APDN (Application-aware Data Center Network) adopts the APN framework for application side to provide more application-aware information for the data center network, enabling the fast evolution of network-application co-design technology. This document elaborates use cases of APDNs and proposes the requirements.

2. Notification for Adaptive Routing

Publication URL: https://datatracker.ietf.org/doc/draft-wh-rtgwg-adaptive-routing-arn/

Introduction:

Large-scale supercomputing and AI data centers utilize multipath to implement load balancing and improve link reliability. Adaptive routing (AR), which is widely used in direct topology such as dragonfly, can dynamically adjust routing policies based on path congestion and failures. When congestion or failure occurs, in addition for the local node to apply AR, the congestion/failure information also needs to be sent to other nodes in a timely and accurate manner, so as to enforce AR in other nodes to avoid exacerbating congestion on the path. This document specifies Adaptive Routing Notification (ARN) for disseminating congestion detection and congestion elimination proactively.

3. Collective Communication Optimization: Problem Statement and Use cases

Publication URL: https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-problem-statement-and-usecases/

Introduction:

Collective communication is the basic logical communication model for distributed applications. When distributed systems scales, the communication overhead becomes the bottleneck of the entire system, impeding system performance to increase. This draft describes the performance challenges when the collective communication is employed in a network with more nodes or processes participating in or a larger number of such communication rounds required to complete a single job. And the document presents several use cases where different aspects of collective communication optimization are needed.

4. Collective Communication Optimization: Requirement and Analysis

Publication URL: https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-requirement-and-analysis/

Introduction:

As is mentioned in draft [CCO PS & USECASE], the most obvious problem on why existing protocols cannot meet the high-performance requirements of collective communications is that these distributed applications are not co-designed with the underlying networking protocols. There is a semantic gap between inter-process message transportation and packet forwarding, which should be bridged by efficient mapping and optimization.

This draft further presents the technical requirements on how the collective communication optimization should be designed, and makes some discussion and analysis on several related work.

5. Coordinated Congestion Management

Publication URL: https://datatracker.ietf.org/doc/draft-lyu-rtgwg-coordinated-cm/

Introduction:

AI fabric is sensitive to bandwidth. Congestion management, including congestion control and load balancing, is a main method to fully utilize network resource. However, current congestion management mechanism are not coordinated, which leads to throughput decreasing. This document provides a scheme for coordinating different congestion management mechanisms. It describes the design principle, behaviors of network switches and hosts in the scheme, and gives an example to show end-to-end procedure.

6. PFC-Free Low Delay Control Protocol

Publication URL: https://datatracker.ietf.org/doc/draft-dai-tsvwg-pfc-free-congestion-control/

Introduction:

This document presents LDCP, a new transport that scales loss-sensitive transports, e.g., RDMA, to entire data-centers containing tens of thousands machines, without dependency on PFC for losslessness, i.e., PFC-free. LDCP develops a novel end-to-end congestion control scheme and achieves very low queue occupancy even under high network utilization or large traffic churns, resulting in almost no packet loss. Meanwhile, LDCP allows a new flow to jump start at full speed at the very beginning and therefore minimizes the latency for short RPC-style transactions. LDCP relies on only WRED and ECN, two widely supported features on switches, so it can be easily deployed in existing network infrastructures. Finally, LDCP is simple by design and thus can be easily implemented by programmable or ASIC NICs.

7. Signaling In-Network Computing operations (SINC)

Publication URL: https://datatracker.ietf.org/doc/draft-lou-rtgwg-sinc/

Introduction:

This memo introduces “Signaling In-Network Computing operations” (SINC), a mechanism to enable signaling in-network computing operations on data packets in specific scenarios like NetReduce, NetDistributedLock, NetSequencer, etc. In particular, this solution allows to flexibly communicate computational parameters, to be used in conjunction with the payload, to in-network SINC-enabled devices in order to perform computing operations.