August 14, 2019

How fast is Intel DC Persistent Memory Module?

TL;DR: Slow, SSD-level.

More details checkout this paper.

I only measured write performance, using the tool pqos-os by Intel.

System Configuration

Item Spec
CPU Intel® Xeon® Gold 6252 CPU @ 2.10GHz * 2
DRAM 2666 MHz - 6 * 32 GB * 2
Intel DCPMM 2666 MHz - 4 * 128 GB * 2
Linux Distro/Kernel Arch Linux - 5.2.7

Bandwidth (sequential write)

Item Spec
DRAM Local 75 GiB/s
DRAM Remote 27 GiB/s
Intel DCPMM Local 8.5 GiB/s
Intel DCPMM Remote 4.5 GiB/s
DRAM Bandwidth

DRAM Remote max bandwidth is about 13 of DRAM Local’s.

Before 10 threads, DRAM Remote bandwidth is about 12 of DRAM Local’s. Because better UPI?

Intel DCPMM Bandwidth

Three threads are enough to full-fill Intel DCPMM local bandwidth, and the bandwidth slightly decreases afterwards.

When using small (<3) amount of threads, remote Intel DCPMM bandwidth is about 14 of local’s, the ratio gradually reaches to 12 before 6 threads. After 7 threads the performance dropped by 50%, and it continues to drop and converges to ~700 MiB/s (slow!!!).

The penalty of none-uniform-pm access is much higher than none-uniform-dram access.

Optane Tricks (Dragon warning!)

The question is “How do you test memory write performance?”

Name Impl.
glibc memset
memset(dst, 1, len)
with clwb
memset(dst, 1, len)_mm_clwb(dst)
with clflushopt
memset(dst, 1, len)_mm_clflushopt(dst)
with clflush
memset(dst, 1, len)_mm_clflush(dst)
avx512 tp
_mm512_store_si512((__m512i*)dst, c)
avx512 tp + clwb
_mm512_store_si512((__m512i*)dst, c)_mm_clwb(dst)
avx512 nt
_mm512_stream_si512((__m512i*)dst, c)
pmem_memset_nodrain(dst, 1, len)

  1. Intel DCPMM performs the best on sequential, read/write-only workload 1.

  2. Application developers should always take control of the cache write back, instead of delegating it to the CPU cache controller.

