



## Petascale Photonic Connectivity for Energy Efficient AI Computing

#### Keren Bergman

Department of Electrical Engineering Columbia University, New York, NY









#### Intro – Keren Bergman

- BS EE Bucknell Univ; MS and PhD EE MIT
- Professor and faculty director CNI, Columbia University
- DOE ASCR; DARPA Exascale Computing Initiative
- ISC 2022 Technical Chair
- Silicon photonics 300mm foundry, Leadership Council
- Lead PI: ARPA-E ENLITENED Program; DARPA ERI PIPES
- Director, SRC JUMP 2.0 Center for Ubiquitous Connectivity (CUbiC) – DARPA, 15 industry partners
- Fellow IEEE, Optica





#### **AI Applications Driving Ever Larger Models for Deep Learning**



Model sizes increased <u>> 6 orders of magnitude</u> in <u>6 years</u>

> 10 Trillion parameters
Exceeds memory
capacity of any single
computing unit





### **Per-Training Energy Consumption**





#### **ML Training – Workloads Energy Consumption**



\*State-of-the-art neural architecture search, trained on 8 NVIDIA P100 GPUs (1,515 W), ~ 656,000 kWh [see arXiv:1906.02243 for full assumptions]

Adapted from E. Strubell, A. Ganesh, A. McCallum, *arXiv:1906.02243* (2019)



#### **Distributed Deep Learning: Communications Bottleneck**





# **Bringing Photonics to the Chip**



Adapted from Gordon Keeler, DARPA



intel.

## **Silicon Photonics Fabrication**

GlobalFoundries<sup>™</sup>





#### Photonics = <u>Massive</u> Parallelism in the Wavelength Domain

Frequency Combs: Multi-Tb/s per Single Link





# Approach to reaching multi-Tbps IO and sub-pJ/b

#### **Key Technical Innovations:**

- Embrace <u>extreme parallelism</u>:
  - Ultra-dense channels generated by >250 wavelengths (WDM) comb source
  - Each wavelength channel modulated at modest data rates for minimizing energy consumption
  - SERDES-*less* operation; energy/bandwidth density co-optimization
- Scalable link architecture:
  - Co-design with broadband comb source
  - Multi-FSR operation regime
- Reduction of thermal energy consumption:
  - Photonics *robust* to fabrication variations
  - Engineered for athermal operation
  - Wafer scale undercut for increased efficiency



Frequency







#### **Scalable Photonic Link Architecture**







#### 2.5D High-Density Packaging – Enables Systems Exploration



# COLUMBIA UNIVERSITY



## **3D Integration to Realize Bandwidth Density**





- Transceiver: 600um x 415um = 0.25mm<sup>2</sup>
- Interleavers: 490um x 310um = 0.15mm<sup>2</sup>
- Bandwidth density:

 $2Tbps / 0.4mm^2 = 5Tbps/mm^2$ 





MCM Packaging





### **3D EIC/PIC Heterogeneous Integration**



Copper Pillar Bumped EIC





S. Daudlin, A. Rizzo, ..., A. Molnar, K. Bergman, Optical Fiber Communication Conference (OFC) 2021

#### COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK Full 300 mm Custom Wafer Cedar

<u>a</u>



(Je

# COLUMBIA UNIVERSITY

#### **3D Bumped PIC Wafer + EIC**







#### Fully Assembled High Density 5.3 Tb/s/mm<sup>2</sup> MCM









#### 800 Gbps at 50 fJ/bit Dense Transmitter Array







**Realizing 50 fJ/bit Transmitter; BER = 10E-12 and 1Vpp** 



S. Daudlin, S. Lee, D. Khilwani, C. Ou, A. Rizzo, S. Wang, M. Cullen, A. Molnar, and K. Bergman, "Ultra-dense 3D integrated 5.3 Tb/s/mm<sup>2</sup> 80 micro-disk modulator transmitter" in OFC 2023, paper M3I.1.





#### 68 fJ/bit and -24.85 dBm Sensitivity 800 Gbps Receiver Array







#### **Fully Packaged MCM with Fiber Array**

✓ Complete packaging of 3-D integrated MCM with wire-bonding and SMF28 fiber array attach







#### MCM TX to MCM RX over 100 meters



- ✓ MCM2 transmitting signal to separate MCM1 receiver
- $\checkmark$  no amplifiers
- ✓ 100 meters
- ✓ 8 Gbps / channel
- ✓ -6 dBm laser power



#### **Need for Data Movement is Growing**

Aggregated IO and Memory BW per Socket







#### **Embedded Photonics – Scaling Ultra-low Energy Memory BW**

Samsung Flashbolt HBM<sup>+</sup>

- Capacity 16GB/stack,
- Memory BW ~400GB/s/stack
- Memory BW/capacity ratio: 25x
- 10x11mm = 110mm<sup>2</sup>

#### Scaling HBM over full interposer:

- ~1000mm<sup>2</sup> with 9 stacks
- 144GB per package with current HBM
- Using 25x memory BW/capacity ratio: ~4 TB/s













#### Embedded Photonics – <u>Flexible</u> Interconnect Fabric







GPU-GPU Interconnect

- NVIDIA A100 DGX 6 NVSwitches
- 8 GPU DGX system aggregate BW 4.8TB/s
- Each NVSwitch ~800 GB/s ~ 20 GB/s/mm<sup>2</sup>

#### **Photonic Fabric:**

- Target **1.28 TB/s/mm<sup>2</sup>** in ~1000mm<sup>2</sup> substrate
- Photonic fabric aggregate BW 1280 TB/s
- 128 ports X 10 TB/s/port
- **Flexible** Spatial/Wavelength/Mode *granularity*
- DARPA LUMOS on-chip gain for scaling
- Optical Multicasting + through on-chip NLO







