Tesla P100 GPU Benchmark

February 16, 2020

Tesla P100 GPU Benchmark

I recently got to play with a GPU server at our lab and wanted to measure it’s performance. To evaluate the server’s performance on a typical deep learning workload I did a simple benchmark with the following setup. First of all, the server itself is equipped with the hardware described in Table 1. To run the experiment, I generated a synthetic dataset of $20000$ images with dimensions $\mathbb{R}^{120 \times 120 \times 3}$ . Next, I created a deep feedforward neural network in TensorFlow with varying number of layers and $2000$ units in each layer. Finally, I measured the run-time to train $5$ epochs on the dataset with varying depth of the neural network and with a hardware setup varying between 1 CPU, 16 CPU, 1 GPU, and 2 GPUs.

Hardware Setup (Table 1)

Metric	Value
RAM	128GB
CPU	16 Intel Xeon 3.50Ghz CPU
GPU	2 Tesla P100 GPU
Storage	2TB SSD
GPU-to-GPU Communication	PCIe

Experiment Parameters (Table 2)

Metric	Value
Batch Size	2056
Dataset Size	$X \in \mathbb{R}^{20000 \times 120 \times 120 \times 3}$
Number of Classes	10
Neurons Per Layer	2000
Steps Per Epoch	30
Number of Epochs	5
Neural Network Architecture	FeedForward Neural Network
Hidden Unit Activation	ReLU
Output Unit Activation	Softmax
Loss Function	Cross-Entropy
Multi-GPU Training	Data-Parallel with Collective-All-Reduce

Results (Table 1 & Fig 1)

The results demonstrate that the GPU scale very well with larger neural network models as compared to the CPU and is an order of magnitude faster on all metrics. Of course, to do a proper benchmark one should let variables such as number of epochs and batch size be variable, which I did not evaluate in this benchmark. Moreover, neither the GPUs nor CPUs were “warmed up” when performing these benchmarks, which could have had some effect on the results. Finally, I did not tune any of the default configurations in TensorFlow, nor did I use any of the optimizations that TensorFlow have available for multi-GPU training. Multi-GPU training did not show any significant performance boost in this case, which might be due to data transfer bottleneck, that the batch size was set too low, or that the number of epoch were too low. The code for the benchmarks is available here.

Hardware	5 Layers	15 Layers	30 Layers	45 Layers	60 Layers	75 Layers
1 Intel Xeon 3.50Ghz CPU	1624s	2396s	3583s	4850s	6046s	7311s
16 Intel Xeon 3.50Ghz CPU	235s	337s	499s	663s	830s	985s
1 Tesla P100 GPU	95s	99s	98s	116s	192s	199s
2 Tesla P100 GPU	97s	101s	109s	107s	115s	181s

GPU Benchmark Results