X-wing - largest single-GPU CFD simulation ever done (AMD MI250)
Watch out, Skywalker! The lasers on your X-wing create turbulence!
TIE fighter simulation:
FluidX3D source code:
This is the largest CFD simulation ever done on a single GPU, cracking ×10⁹ LBM grid points with FluidX3D on the mighty AMD Instinct MI250 (only on 1 GCD with 64GB*).
Simulating 50k time steps took 104 minutes at 1076×2152×538 resolution, plus 40 minutes for rendering 4x 30s 1080p video. Shown is the Q-criterion isosurfaces with marching-cubes. Reynolds number is 100k with Smagorinsky subgrid model, but at this resolution it would probably also run stable without.
*The MI250 is actually 2 entirely separate GPUs (GCDs) in a single socket with 64GB memory each. One GCD can’t directly access the memory of the other. This simulation is only running on one GCD, using the full 64GB of its unified memory.
To use both GCDs, or better all 8 GCDs in the server at Jülich Supercomputing Center, the code would need to be specifically extended for multi-GPU support. This is very difficult and time-consuming, and for some parts of FluidX3D, like raytracing graphics, it is close to impossible. Maybe one day I will find the time to do it, but not now.
Still, how is it possible to squeeze billion grid points in only 64GB?
I’m using two techniques here, which together form the holy grail of lattice Boltzmann, cutting memory demand down to only 55 Bytes/node for D3Q19 LBM, or 1/3 of conventional codes:
1. In-place streaming with Esoteric-Pull. This almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries.
Paper:
2. Decoupled arithmetic precision (FP32) and memory precision (FP16): all arithmetic is done in FP32, but LBM density distribution functions in memory are compressed to FP16. This almost cuts memory demand in half and almost doubles performance, without impacting overall accuracy for most setups.
Paper:
Graphics are done directly in FluidX3D with OpenCL, with the raw simulation data already residing in ultra-fast video memory. No volumetric data (1 frame of the velocity field is 14GB!) ever has to be copied to the CPU or hard drive, but only rendered 1080p frames (8MB) instead. Once on the CPU side, a copy of the frame is made in memory and a thread is detached to handle the slow .png compression, all while the simulation is already continuing. At any time, about 16 frames are compressed in parallel on 16 CPU cores, while the simulation is running on GPU.
Paper:
Timestamps:
0:00 bottom up view
0:30 follow view
0:59 side view
1:29 top down view
#CFD #GPU #FluidX3D #OpenCL
43 views
24
4
3 months ago 00:09:11 1
React KATY PERRY - WOMAN’S WORLD AND CHAINED TO THE RHYTHM react reaction reaccion
3 months ago 00:17:02 1
France To Repeat Indira Gandhi’s Mistake- Impose 90% Tax Rate! Future Of Europe! Kinjal Choudhary
3 months ago 00:07:04 1
Labour Business Secretary INSULTS 4 MILLION VOTERS on BBC
3 months ago 02:56:34 1
Звездный Разрушитель «Победа» / Victory-class Star Destroyer
3 months ago 00:01:07 1
French Election Results: Le Pen Says France Is ’Totally Deadlocked’
3 months ago 00:20:01 1
Top 100 PC Games of the 90s!
3 months ago 00:11:16 1
Labour’s Landslide Shows Britain is Moving Right - Konstantin Kisin
3 months ago 00:09:39 1
France’s Stunning Election Results Explained
3 months ago 00:15:44 1
Understanding Manual Focus Updated
3 months ago 00:07:01 1
10 Interesting Insects || Insects for Kids || Bugs for Kids
3 months ago 00:03:57 1
Jimi Hendrix - Little Wing (Steve Marks Acoustic Cover)
3 months ago 00:18:06 1
ПОЧЕМУ LEGO НЕ ХОТЕЛА ВЫПУСКАТЬ ЗВЕЗДНЫЕ ВОЙНЫ? | История Серии STAR WARS
3 months ago 00:16:35 1
Наборы Августа LEGO Star Wars! (Тёмный Сокол, ИЗР, Пустынный Скиф, C-3PO, X-Wing и TIE Fighter)