X-wing - largest single-GPU CFD simulation ever done (AMD MI250)
Watch out, Skywalker! The lasers on your X-wing create turbulence!
TIE fighter simulation:
FluidX3D source code:
This is the largest CFD simulation ever done on a single GPU, cracking ×10⁹ LBM grid points with FluidX3D on the mighty AMD Instinct MI250 (only on 1 GCD with 64GB*).
Simulating 50k time steps took 104 minutes at 1076×2152×538 resolution, plus 40 minutes for rendering 4x 30s 1080p video. Shown is the Q-criterion isosurfaces with marching-cubes. Reynolds number is 100k with Smagorinsky subgrid model, but at this resolution it would probably also run stable without.
*The MI250 is actually 2 entirely separate GPUs (GCDs) in a single socket with 64GB memory each. One GCD can’t directly access the memory of the other. This simulation is only running on one GCD, using the full 64GB of its unified memory.
To use both GCDs, or better all 8 GCDs in the server at Jülich Supercomputing Center, the code would need to be specifically extended for multi-GPU support. This is very difficult and time-consuming, and for some parts of FluidX3D, like raytracing graphics, it is close to impossible. Maybe one day I will find the time to do it, but not now.
Still, how is it possible to squeeze billion grid points in only 64GB?
I’m using two techniques here, which together form the holy grail of lattice Boltzmann, cutting memory demand down to only 55 Bytes/node for D3Q19 LBM, or 1/3 of conventional codes:
1. In-place streaming with Esoteric-Pull. This almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries.
Paper:
2. Decoupled arithmetic precision (FP32) and memory precision (FP16): all arithmetic is done in FP32, but LBM density distribution functions in memory are compressed to FP16. This almost cuts memory demand in half and almost doubles performance, without impacting overall accuracy for most setups.
Paper:
Graphics are done directly in FluidX3D with OpenCL, with the raw simulation data already residing in ultra-fast video memory. No volumetric data (1 frame of the velocity field is 14GB!) ever has to be copied to the CPU or hard drive, but only rendered 1080p frames (8MB) instead. Once on the CPU side, a copy of the frame is made in memory and a thread is detached to handle the slow .png compression, all while the simulation is already continuing. At any time, about 16 frames are compressed in parallel on 16 CPU cores, while the simulation is running on GPU.
Paper:
Timestamps:
0:00 bottom up view
0:30 follow view
0:59 side view
1:29 top down view
#CFD #GPU #FluidX3D #OpenCL