A GPU optimized fast Bloch simulator was developed for arbitrary pulse sequence inputs. The simulator was applied to multi-slice imaging using 256×256×512 calculation matrix. As a result, we found that the number of short pulses used to approximate the selective excitation pulse and the number of subvoxels used for imaging in the cross-sectional plane were essential to simulation speed.
Introduction
Recent rapid development of graphical processor units (GPUs) has enabled fast Bloch simulation using small computer systems. Because single-precision floating-point number calculation is sufficient for most Bloch simulation [1], the cost per TFLOPS for Bloch simulators is now less than $100. Therefore, full matrix simulation of MRI pulse sequences has become realistic even in small computer systems. In this study, we developed a fast Bloch simulator using two GPU boards for arbitrary MRI pulse sequences and compared the results with experiments.Materials and Methods
We developed a GPU optimized Bloch simulator using two GPU boards (GeForce GTX1080): one GTX1080 has 2560 CUDA cores, 8 GB memory, 1607 MHz clock frequency, and 8.9 TFLOPS theoretical processing speed for single-precision floating-point numbers. Figure 1(a) shows the dialog box of our simulator, whose input parameters are categorized into three classes. The first is pulse sequence files, which are selected and stacked in the child window to be serially executed. The second is gradient coil parameters. The third is information related to the image matrix, which includes matrix size, FOV, files for the numerical phantom, and files for magnetic field distribution (B0, B1 transmission, B1 reception, and field gradients). Figure 1(b) is a typical pulse sequence file for a 3D RF spoiled gradient echo (SPGR) sequence that can be used for both experiments and simulation. After the pulse sequence was read from the sequence file, it was automatically converted to CUDA source codes and compiled to GPU execution codes using the NVRTC [1]. Gradient-echo multi-slice sequences (TR/TE = 1200ms/6ms, time between excitation pulses = 100 ms, excitation pulse: sinc(x), -4π<x<4π, 8 kHz bandwidth) were used for simulation and experiments with a multi-slice phantom. The phantom consisted of a triangular acrylic block (70mm×25mm, 20mm thick) and nine acrylic round bar (diameter = 5 mm) stored in an acrylic cylindrical container (inner diameter = 53 mm, length = 84 mm) filled with CuSO4 water solution (T1 = T2 = 120 ms). The slice thickness and slice interval were both 5 mm. Imaging experiments were performed using a home-built 1.5 T MRI system utilizing a horizontal bore (280 mm) superconducting magnet and a digital MRI console.Results
Figure 2 shows the MR signal display window for real-time monitoring the process of the Bloch simulation. Figures 2(a) and (b) show MR signal for a gradient-echo multi-slice sequence and that for a 3D-SPGR sequence. The overhead time for the MR signal display was negligible comparing with the simulation time. Figure 3 shows 2D cross-sectional images acquired with the gradient-echo multi-slice sequence calculated using short pulse approximation of the selective-excitation pulse. The matrix size for the calculation was 256×256×512 with two subvoxels along the readout direction. The number of the short square pulses was 100 and 200 for (a) and (b), and the calculation time was 190 and 472 seconds for (a) and (b). While ghost images were observed when 100 pulses were used, no ghost image was observed when 200 pulses were used. This result clearly demonstrated that the 100 short pulses cannot describe the motion of the magnetization accurately when the offset frequency is more than about ±30 kHz. Figure 4 shows cross-sectional images acquired with the gradient-echo multi-slice sequence when the number of subvoxel was changed from 1 to 8. The matrix size for the calculation was 128×128×512. While considerable artifacts were observed when the number of subvoxel was less than 2, the artifacts disappeared when it was more than 4. Figure 5 shows comparison between (a) experiments and (b) simulation. B1 distribution experimentally measured beforehand was used for RF transmission and signal reception. The B1 distribution was rotated from the experiment to the simulation because the birdcage coil was actually rotated.Discussion
In Bloch simulation, there are two massive calculations: one is summation of magnetization vector over all voxels to acquire MR signal, and the other is calculation of RF excitation in inhomogeneous magnetic fields such as slicing gradients. In our Bloch simulator, the former problem was solved using the parallel processing capability of the GPU and the latter problem was solved using Rodrigues' rotation formula, which simplified several rotation operations to one matrix operation. In the present study, it was found that the number of short square pulses used to approximate the selective excitation pulse and the number of subvoxel to calculate MR images in the imaging plane were very important factors to accelerate the simulation speed. In conclusion, our GPU optimized Bloch simulator has sufficient capability to be applied to multi-slice imaging, which is clinically important and useful.