LRDIMM latency vs. DDR4

UPDATE: 06/24/2012: Invensas on LRDIMM design inferiority vs. HyperCloud
UPDATE: 07/27/2012 – confirmed HCDIMM similar latency as RDIMMs
UPDATE: 07/27/2012 – confirmed LRDIMM latency and throughput weakness

RDIMMs have a 1 cycle latency compared to UDIMMs (unregistered DIMMs).

LRDIMMs have a “5 ns latency penalty” compared to RDIMMs (from Inphi LRDIMM blog).

HyperCloud has similar latency as RDIMMs and has a rather significant “4 clock latency improvement” over the LRDIMM (quote from Netlist Craig-Hallum conference – October 6, 2011).

LRDIMM latency penalty

LRDIMMs have a “5 ns latency penalty” compared to RDIMMs (from Inphi LRDIMM blog):

http://lrdimmblog.inphi.com/lrdimm-has-lower-latency-than-rdimm.php
LRDIMM has Lower Latency than RDIMM!
By David Wang on 08-09-2011 at 5:06 PM

As described previously in other posts and in the whitepaper on the LRDIMM blog site, the buffering and re-driving of the data signals enable the LRDIMM to support more DRAM devices on the memory module, and for the entire memory module to operate at higher data rates.

The key to the LRDIMM-has-lower-latency-than-RDIMM claim lies in the fact that an LRDIMM memory system can operate at higher data rates than the comparably configured RDIMM memory system. Consequently, a higher data rate LRDIMM-based memory system can overcome the latency burden of having to buffer and re-drive the signals, and attain lower access latency compared to a lower data rate RDIMM-based memory system.

It shows that when operating at the same data rate, the Quad-rank LRDIMM has approximately 5 ns longer latency than the Quad-rank RDIMM. However, it also shows that the random access latency of both the LRDIMM and RDIMM memory systems decreases with increasing data rate. Consequently, when the highest-speed-bin RDIMM memory system, operating at 1066 MT/s, is compared to an LRDIMM memory system operating at 1333 MT/s, the LRDIMM memory system operating at 1333 MT/s is shown to have the lowest access latency compared to an RDIMM memory system.

In essence, a higher data rate LRDIMM memory system enables memory requests to traverse through the memory controller at a higher frequency, enjoying great low latency benefits, overcoming the latency overhead of the data buffering on the LRDIMM, and resulting in net reduction in access latency compared to an RDIMM memory system that operates at its maximum (1066 MT/s) frequency.

This Inphi blog post above suggests a “lower latency for LRDIMMs” ! When in fact the numbers it quotes confirm that LRDIMMs have a “5 ns longer latency” than RDIMMs.

What the article then proceeds to mix up is a discussion of overall performance with “latency”, when it is clear that the latency is not improved on the LRDIMMs, it is just that when you compare an LRDIMM operating at 1333MHz with a SLOWER running RDIMM at 1066MHz you ARE going to see the LRDIMM give better scores on the Sandra benchmarks.

While this may be a good comparison when discussing overall performance, it is misleading to say that it somehow reduces the “latency”.

In addition what the article fails to point out is that if you are dealing with a constant “5 ns longer latency” (due to use of asymmetrical lines and centralized buffer chipset on the LRDIMM), that same latency (time delta) translates into a higher latency clock count (if you count them in clock cycles) as you go to higher clock frequencies (because the clock period is shorter and you can fit more clock cycles in a “5 ns longer latency” period).

This suggests that the latency discrepancy (in terms of clock cycles) gets worse for LRDIMMs vs. HyperCloud as you move to higher frequencies.

And at 1333MHz, the latency discrepancy in terms of clock cycles between LRDIMMs and HyperCloud will be much worse than at 1066MHz.

LRDIMMs only do 1066MHz at 3 DPC on Romley servers. However when the same LRDIMMs are run at 1333MHz at 3 DPC (with the help of a BIOS tweak as on the IBM System X x3750 M4 servers), the latency in terms of clock cycles may look even worse when compared vs. RDIMMs or HyperCloud.

Cisco UCS latency penalty

Cisco UCS has a “6 ns latency penalty” compared to RDIMMs (just read this figure on a blog – not sure of it’s accuracy).

Listing this for comparison, as Cisco UCS has now switched away from their earlier ASIC-on-motherboard solution which suggests they are now going to be using the LRDIMM/HyperCloud type of solutions (which are ASIC-on-memory-module) for “load reduction”.

http://www.theregister.co.uk/2012/03/08/cisco_ucs_xeon_e5_servers/
Cisco outs third gen UCS blades and racks
California dreaming
By Timothy Prickett Morgan
Posted in Servers, 8th March 2012 16:48 GMT

The B200 M3 blade supports the Xeon E5-2600s with four, six, or eight processor cores and supports up to 384GB using regular, registered DDR3 memory sticks in 16GB capacities. The Cisco spec sheets do not say it supports LR-DIMM memory, but the presentation I have seen says the box does support 768GB, and that means 32GB sticks are coming – and for all the other vendors I have spoken to, getting to the full 768GB capacity has meant using LR-DIMMs.

What I can tell you is that Cisco has not used its own Nuova memory extension ASIC, used on some of the existing B Series blades and C Series rack servers, to boost memory capacity by as much as a factor of 2.7. Satinder Sethi, vice president of Cisco’s Server Access & Virtualization Technology Group, said that none of the three Xeon E5-2600 machines launched today use the Nuova memory-stretcher ASIC.

Here is an earlier article comparing the Cisco UCS approach (ASIC-on-motherboard) vs. Netlist’s (ASIC-on-memory-module):

http://www.theregister.co.uk/2009/11/11/netlist_hypercloud_memory/
Netlist goes virtual and dense with server memory
So much for that Cisco UCS memory advantage
By Timothy Prickett Morgan
Posted in Servers, 11th November 2009 18:01 GMT

HyperCloud latency trumps LRDIMMs

NLST HyperCloud have similar latency as RDIMMs (a huge advantage) and have a rather significant “4 clock latency improvement” over the LRDIMM (quote from Netlist Craig-Hallum conference – October 6, 2011):

http://www.netlist.com/investors/investors.html
Craig-Hallum 2nd Annual Alpha Select Conference
Thursday, October 6th at 10:40 am ET

http://wsw.com/webcast/ch/nlst/

Question:

at the 23:35 minute mark:

(unintelligible)

Chris Lopes:

Inphi (IPHI). Good question. How is HyperCloud different from what IPHI is offering.

IPHI is a chip company – so they build a register.

The register is then sold to a memory company.

And the memory company builds a sub-system with that.

And that’s the module they are calling an LRDIMM or Load-Reduced DIMM.

The difference is that the chip is one very large chip, whereas we have a distributed buffer architecture, so we have 9 buffers and one register.

Our register fits in the same normal footprint of a standard register, so no architectural changes are needed there.

at the 24:35 minute mark:

And our distributed buffers allow for a 4 clock latency improvement over the LRDIMM.

So the LRDIMM doubles the memory. HyperCloud doubles the memory.

LRDIMM slows down .. the bus. HyperCloud speeds up the bus.

So you get ours plugged in without any special BIOS requirement.

So it plugs into a Westmere, plugs into a Romley, operates just like a register DIMM which is a standard memory interface that everyone of the server OEMs is using.

The LRDIMM requires a special BIOS, special software firmware from the processor company to interface to it.

And it’s slower.

Does that answer your question ?

As a result of this latency issue, 32GB LRDIMMs thus underperform the 32GB HyperCloud (CMTL benchmarks for LRDIMM vs. HyperCloud):

http://www.netlist.com/products/hypercloud/whitepapers/hcdimm_vs_lrdimm_whitepaper_march_2012.pdf
HyperCloud HCDIMM Outperforms LRDIMM in  ‘Big Data’ & ‘Big Memory’ Applications 
Whitepaper 
March 2012  

In addition, 32GB LRDIMMs cannot deliver 3 DPC at 1333MHz (perhaps because they use 4Gbit x 2 dual-die packaging memory (DDP) which may hinder load reduction efforts).

As a results, not only is the HyperCloud memory lower latency (i.e. better), but also delivers 1333MHz at 3 DPC (where LRDIMMs cannot).

DDR4 latency

DDR4 has dumped some of the asymmetrical lines and centralized buffer chipset choices in the LRDIMMs, in favor of an even closer following of the symmetrical lines and decentralized buffer chipset on the Netlist HyperCloud:

http://www.theregister.co.uk/2011/11/30/netlist_32gb_hypercloud_memory/
Netlist puffs HyperCloud DDR3 memory to 32GB
DDR4 spec copies homework
By Timothy Prickett Morgan
Posted in Servers, 30th November 2011 20:51 GMT

Here is a CMTL labs comparison of HCDIMM and LRDIMM that goes into more detail:

CMTL HCDIMM Outperforms LRDIMM in “Big Data” & “Big Memory” Applications White Paper
http://www.netlist.com/products/hypercloud/whitepapers/hcdimm_vs_lrdimm_whitepaper_march_2012.pdf

UPDATE: 06/24/2012: Invensas on LRDIMM design inferiority vs. HyperCloud

Here is a paper that describes a future DDP (dual die packaging) design. One of the authors – Bill Gervasi – is a former Netlist employee and a former JEDEC committee chair. The paper mentions the applicability to both LRDIMMs and HyperCloud. Here is the section where they compare the design weaknesses in the LRDIMMs with the superiority of the HyperCloud design:

http://www.invensas.com/Company/Documents/Invensas_ISQED2012_CostMinimizedDoubleDieDRAMUltraHighPerformanceDDR3DDR4MultiRankServerDIMMs.pdf
Cost-minimized Double Die DRAM Packaging for Ultra-High Performance DDR3 and DDR4 Multi-Rank Server DIMMs
Richard Crisp 1 , Bill Gervasi 2 , Wael Zohni 1 , Bel Haba 1
1 Invensas Corp, 2702 Orchard Parkway, San Jose, CA USA
2 Discobolus Designs, 22 Foliate Way, Ladera Ranch, CA USA

pg. 3:
5. Applicability to LRDIMM and Hypercloud DIMMs

The LRDIMM differs from the RDIMM in that the DQ and DQ Strobe signals are buffered[1]. The data buffer is placed in the central region of the LRDIMM. This requires all data and data strobes to be routed from each DRAM package to the buffer and then routed back to the edge connector which demands additional routing layers versus an RDIMM. Since the LRDIMM is plugged into an edge connector, the thickness of the DIMM PCB is fixed.

Adding PCB layers necessarily requires a reduction of the thickness of the dielectric layers separating the power planes and routing layers. Unless the width of the traces is made narrower, the characteristic impedance of the etched traces is decreased and can lead to signal reflections arising from impedance discontinuities that diminish voltage and timing margin.

Trace width is limited by the precision of the control of the etching process, with such narrower traces being more costly to manufacture within tolerance. Because the DFD’s C/A bus routes on a single layer and other interconnections lay out cleanly, the layer count is reduced leading to nominal impedances being attainable with normal dimensional control keeping raw card costs from rising.

The Hypercloud architecture is similar to the LRDIMM in that the DQ and DQ Strobe signals are buffered, but unlike the LRDIMM the buffering is provided by a number of data buffer devices placed between the edge connector and the DRAM package array on the DIMM PCB[2]. The 11.5 x 11.5 mm package outline of the DFD supports placement of the buffers without requiring growth of the vertical height of the DIMM. In fact a simple modification of the RDIMM PCB will enable the Hypercloud data buffers to be mounted on the PCB making conversion of an RDIMM design to Hypercloud a straightforward matter.

Here Invensas is saying that with the LRDIMM centralized buffer chipset, there are a greater number of lines to route back and forth between DRAM packages and the central buffer. This forces the use of a greater number of PCB layers, each of which then needs to be thinner, and leads to signal quality issues.

For these reasons, DDR4 should have much better latency than one would expect from LRDIMM alone.

UPDATE: 07/27/2012 – confirmed HCDIMM similar latency as RDIMMs
UPDATE: 07/27/2012 – confirmed LRDIMM latency and throughput weakness

HyperCloud HCDIMM latency similarity with 16GB RDIMM (2-rank) latency has been confirmed. And LRDIMM latency and throughput weakness vs. HCDIMM – even when running at the SAME lowered speeds of the LRDIMM – confirmed:

https://ddr3memory.wordpress.com/2012/07/26/latency-and-throughput-figures-for-lrdimms-emerge/
Latency and throughput figures for LRDIMMs emerge
July 26, 2012

Advertisements

8 Comments

Filed under Uncategorized

8 responses to “LRDIMM latency vs. DDR4

  1. Pingback: JEDEC fiddles with DDR4 while LRDIMM burns | ddr3memory

  2. Pingback: Why are LRDIMMs single-sourced by Inphi ? | ddr3memory

  3. Pingback: Memory choices for the IBM System X x3750 M4 servers | ddr3memory

  4. Pingback: Why are 16GB LRDIMMs non-viable ? | ddr3memory

  5. Pingback: Non-viability of 32GB RDIMMs | ddr3memory

  6. Pingback: Examining LRDIMMs | ddr3memory

  7. Pingback: Examining Netlist | ddr3memory

  8. Pingback: Latency and throughput figures for LRDIMMs emerge | ddr3memory

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s