Parallel transposing and communication strategies for FFT on cluster of SMP architectures with multicore processors
This paper presents a high performance parallel formulation for 1-D FFT based on transpose algorithm. The parallel scheme of FFT trades off some efficiency for a more consistent level of parallel performance. It involves matrix transposition, and a new in-place transposing algorithm is introduced into the parallel matrix transposition to improve the efficiency. Depending on the size of the input n, the number of processes p, and the memory or network bandwidth, this method can achieve better parallel performance than the other on cluster of SMP architectures. Test shows us that our scheme can achieve a good speedup.