Memory choices for the IBM System X x3750 M4 servers

IBM improves the memory bus ?

UPDATE: added 06/05/2012 – possible reason for speedup
UPDATE: added 06/19/2012 – draft IBM Redpaper
UPDATE: added 06/22/2012 – IBM feedback on speedup
UPDATE: added 07/04/2012 – “above 384GB requires HyperCloud”
UPDATE: added 07/26/2012 – LRDIMM underperforms HCDIMM even at same speeds

These are 4-socket servers (4 processors) with 12 DIMMs per processor for a total of:

4 processors x 12 DIMMs per processor = 48 DIMM slots

It is a server with the same disk storage space as the x3650 M4 servers, but with 2x the CPU and memory (and maybe suited to more compute intensive tasks than the x3650 M4).
IBM System x3750 M4
IBM System x3750 M4 Implementation Guide
A draft IBM Redpaper publication

Improved speeds

What is astounding about the specs given for this servers (assuming they are not a typo and they are not hiding some other deficiency for example much higher latency etc.) is that they seem to allow the same memory modules to run by a faster speed grade – i.e. both RDIMMs and LRDIMMs are running faster.

This is a similar story as IBM/HP have delivered for their other Romley servers where they promise “better than Intel spec” memory speeds, but here IBM takes it up a notch.

And they do this for the standard RDIMM, LRDIMMs they are selling – this suggest that HCDIMM (though not qualified for this particular server), if available would probably also benefit by a speed grade (if the improvements really are to the memory bus).

Why only IBM/HP

And what is interesting is it is only IBM/HP who have so far promised these “better than Intel specs”.

What is different about both IBM/HP and the other server makers is that IBM/HP both have partnerships with Netlist for HyperCloud (on some of their high volume data center servers).

Which raises the possibility that Netlist’s involvement in qualifying HyperCloud with IBM/HP may have led to feedback that improved IBM/HP memory bus capability. It is possible that in qualifying HyperCloud, some of the feedback by NLST engineers was made to IBM/HP to improve the memory bus (pure speculation).

None of the other players – Dell, SuperMicro etc. have mentioned any of this for Romley.

It is unlikely that HyperCloud itself would have contributed – since it is a self-contained memory module. This seems to be an improvement in the motherboard circuitry that has allowed such improvements.

Since we don’t have an explanation from IBM/HP on how they are able to deliver these improvements, it is unclear what if any side-effect these improvements bring with them (or example – is IBM using an ASIC-on-motherboard approach like Cisco UCS used with their “Catalina” ASIC and have since abandoned). Or if there is a latency penalty that is incurred from such a solution. However an ASIC-on-motherboard solution seems remote, since in that case you would not need an LRDIMM/HyperCloud solution for load reduction (as Cisco UCS did not need – until they moved away from that approach in recent servers).

Why are LRDIMMs still listed – the importance of load reduction

The speed improvements on this server seems to let RDIMM run at the same speed as the LRDIMM. And both are higher.

Which raises the question – why even use the LRDIMM then ?

The reason is load reduction.

A “load reduction” solution is needed at 32GB memory module sizes because 32GB RDIMMs will only be available at 4-rank. 2-rank requires availability of 8Gbit DRAM die which will not be available for a few years if ever – see other articles). 4-rank experiences abysmal speed slowdown not just at 3 DPC, but also 2 DPC and possibly even 1 DPC (see other articles here).

However, load reduction and rank multiplication are Netlist IP.

IP issues for LRDIMM and DDR4 – users of load reduction

Inphi currently the only maker of LRDIMM buffer chipsets – others have backed off – lost a challenge of Netlist IP at the USPTO. As a result the Netlist patents have become stronger and are going to come back and bite Inphi in Netlist vs. Inphi which was stayed pending these patent reexaminations. Patents which survive reexamination can never again be challenged in court – NLST patents ‘537 and ‘274 survived with ALL claims intact which is a powerful statement on the strength of their IP (and the frivolity of the Inphi challenge).

DDR4 has dumped some of the asymmetrical lines and centralized buffer chipset choices in the LRDIMMs (which are the cause of high latency issues in LRDIMMs), in favor of an even closer following of the symmetrical lines and decentralized buffer chipset on the Netlist HyperCloud (see other articles here).

So in summary, load reduction will be an essential technology at 32GB and beyond to DDR4. DDR4 will require licensing from Netlist (which might cover LRDIMM along the way) since at DDR4 low voltage and higher frequencies the need for load reduction becomes even more important.

32GB load reduction solutions – 32GB LRDIMMs vs. 32GB HCDIMMs

This is covered in more detail in other articles here.

Here is a comparison of 32GB LRDIMMs vs. 32GB HyperCloud (IBM 32GB HCDIMM and HP 32GB HP Smart Memory HyperCloud – available mid-2012):

– 32GB LRDIMMs are slower than 32GB HyperCloud when both are run on the same machine
– 32GB LRDIMMs have higher latency than 32GB HyperCloud.
– 32GB LRDIMMs have legal risk associated with them (of recall or cancellation)
– 32GB LRDIMMs are expensive (much more than 2x the 16GB LRDIMMs)
– 32GB HyperCloud will be cheaper than 32GB LRDIMMs also – because the 32GB LRDIMMs use 4Gbit x 2 (DDP) while NLST 32GB HyperCloud uses 4Gbit (monolithic) and leverages their Planar-X IP to make 32GB HyperCloud

UPDATE: added 06/05/2012
Possible reason for speedup

It seems this speedup might be related to what other manufacturers (like SuperMicro) call “Forced SPD” in the BIOS settings. This “may” allow the RDIMMs to run at 1333MHz at 3 DPC. However it is considered outside the Intel PoR (“plan of record”) and is thus “use at your own risk”. And many may not want to run the system with this setting on. Perhaps IBM is supporting this for HPC applications.

However, all is not well when you run LRDIMMs at 1333MHz. Since the latency issues remain, the time delta latency hit that LRDIMMs experience at 1066MHz becomes a LARGER hit (in terms of clock cycles) at 1333MHz (since at higher clock frequency the clock period is shorter and you can fit more clock cycles within the same latency time delta).

When the same LRDIMMs are run at 1333MHz at 3 DPC (with the help of a BIOS tweak as on the IBM System X x3750 M4 servers), the latency in terms of clock cycles may look even worse when compared vs. RDIMMs or HyperCloud.
LRDIMM latency vs. DDR4
May 31, 2012

UPDATE: added 06/22/2012 – IBM feedback on speedup

We have feedback from IBM (see comments below).

IBM suggests they were able to achieve the speedup by improvements on the motherboard – essentially confirming that the improvements were on the “memory bus” side as crudely described above. The improved signal quality is “the tide which lifts all boats” – as a result both RDIMMs and LRDIMMs (the only two memory types available – as IBM HCDIMM is not available (yet) on this server) experience similar speedup.

For IBM HCDIMMs to benefit from this leeway, one could conjecture a speedup to 1600MHz maybe possible (since HCDIMMs deliver 3 DPC at 1333MHz on regular Romley servers already – like on the IBM x3650 M4 – i.e. faster than LRDIMMs and RDIMMs at 3 DPC). However, Netlist would have to produce 1600MHz DRAM based IBM HCDIMMs to be usable there (which may not be available right now).

However at the 32GB level, the 32GB RDIMMs will perform poorly (being 4-rank), and the choices left will be 32GB LRDIMMs vs. 32GB HCDIMMs (which should be available mid-2012 as stated by Netlist in prior CC).

At that point the 32GB HCDIMMs should outperform the 32GB LRDIMMs – since the differences are speed, but also that LRDIMMs have high latency issues.

LRDIMMs DO have one advantage – they are produced by a number of memory module makers (who use the LRDIMM buffer chipsets from Inphi). For this reason when you examine the whole universe of LRDIMM solutions there WILL be considerably more options purely because there are more player using the same buffers in different ways.

For this reason, Netlist may have chosen only a couple of servers (the HP DL360p and DL380p and the IBM System X x3650 M4 server) to target initially – chosen for high volume use and with fewer varieties of memory module to support.

Same RDIMMs, LRDIMMs perform better

Coming back to the performance of RDIMMs and LRDIMMs on the IBM x3750 M4 – it is the SAME RDIMMs and LRDIMMs that are performing better (than on the regular Romley servers like the IBM x3650 M4).

IBM’s explanation seems to be along the same lines as explained here for how LRDIMMs differ from DDR4.

On the LRDIMMs there are asymmetrical line lengths and centralized buffer chipset – on the DDR4 (which is copying HyperCloud even more) the lines are symmetrical and there is a decentralized buffer chipset. The symmetrical lines reduce signal “skew” and allow better tuning.

On the IBM x3750 M4 they have tried to do something similar for the shortcomings on the motherboard side:

– shorten line lengths from processor to DIMM slots
– make line lengths symmetrical from processor to the individual DIMM slots

One would expect that if IBM HCDIMM become available for the IBM x3750 M4, then they too would demonstrate better performance than usual as we have seen with the “rising tide lifts all boats” behavior of the same RDIMMs, LRDIMMs performing better on the IBM x3750 M4 than on the “regular” Romley servers like the IBM x3650 M4.

UPDATE: added 07/04/2012 – “above 384GB requires HyperCloud”

Memory choices for the IBM x3750 M4 server

Even if HyperCloud does not benefit and exhibit a speed grade advantage above LRDIMMs, they still remain preferable over the LRDIMM (which have performance, latency, price and IP issues).

At the 16GB memory module level, even if HyperCloud has greater leeway than the LRDIMM on the IBM x3750 M4 servers (thanks to the non-Intel PoR memory bus/motherboard tweak), it will only be demonstrated if they become available in 1600MHz varieties – however HyperCloud still will retain a latency advantage over LRDIMMs.

However, at the 32GB memory module level, HyperCloud outperforms RDIMMs and LRDIMMs above 384GB (on a 2-socket server) – on performance, latency, price and IP issues:
Infographic – memory buying guide for Romley 2-socket servers
June 29, 2012

Since 32GB HyperCloud will be made using 4Gbit monlithic memory packages (leveraging the Netlist Planar-X IP) compared to the 32GB RDIMMs and 32GB LRDIMMs which are produced using 4Gbit x 2 DDP memory packages (which are more expensive). So 32GB HyperCloud will have price superiority as well over RDIMM and LRDIMM.
Memory buying guide – including 1.35V memory for Romley
June 28, 2012

Thus for all non-Intel PoR servers (like the IBM x3750 M4 server – which have the memory bus/motherboard tweak) and the higher end servers from IBM, the memory choice “rule of thumb” becomes:

– “above 384GB requires HyperCloud”

UPDATE: added 07/26/2012 – LRDIMM underperforms HCDIMM even at same speeds

Significance of comparing LRDIMMs and HCDIMMs running at 1066MHz at 3 DPC

The IBM x3750 M4 server goes beyond the Intel POR (plan-of-record) and implements some motherboard tweaks that “lifts all boats” i.e. allows RDIMMs and LRDIMMs to run faster.

Presumably if HCDIMMs become available on this server, they too would benefit from this leeway (possibly running faster than 1333MHz ?).

In the text accompanying Figure 4 in their blog, Netlist mentions 1600MHz HyperCloud HCDIMM availability in Q1 2013:
HyperCloud HCDIMM: Scaling the High Density Memory Cliff
July 24, 2012 at 5:00 AM

See text accompanying Figure 4 – which suggests that 1600MHz HCDIMM will be available in Q1 2013.

In Figure 6 in the above blog – Netlist compares:

– 32GB LRDIMMs running at their max acheivable 1066MHz at 3 DPC
– 32GB HyperCloud HCDIMM running at a SLOWED-DOWN 1066MHz at 3 DPC

HCDIMM can actually run faster at 1333MHz at 3 DPC – but is slowed down for the purposes of the comparision with LRDIMMs.

One can see from the results (the LRDIMM results are taken from HP docs) that even with the deliberately slowed down HCDIMM, the LRDIMM performs poorly.

What Netlist is trying to demonstrate is the architectural weakness of the LRDIMM (running both at same speed to demonstrate that).

For more on the architectural weaknesses in the LRDIMM design (asymmetrical lines and centralized buffer chipset):
LRDIMM latency vs. DDR4
May 31, 2012

For a detailed examination of the LRDIMM “loaded latency” and “throughput” figures (pointed out by the Netlist blog above), checkout:
Latency and throughput figures for LRDIMMs emerge
July 26, 2012

The comparison between LRDIMMs and HCDIMMs running at a similar 1066MHz at 3 DPC suggests that even if LRDIMM were to experience an “improved” speed of 1333MHz at 3 DPC on a non-Intel-POR server such as the IBM x3750 M4 server, there may still be significant latency and throughput disadvantages inherent in the LRDIMM design which renders them inferior to RDIMM-compatible options like the HyperCloud HCDIMMs.

This has been demonstrated above by comparing LRDIMMs at 1066MHz at 3 DPC vs. HCDIMMs SLOWED down to 1066MHz at 3 DPC (since HCDIMMs can run faster at 1333MHz at 3 DPC).

This creates a very difficult use-case for LRDIMMs vs. the RDIMM-compatible HyperCloud HCDIMMs:

– LRDIMMs are not compatible with DDR3 RDIMMs (are a new standard)
– LRDIMMs have worse latency at 3 DPC vs. HCDIMMs
– LRDIMMs have worse throughput at 3 DPC vs. HCDIMMs



Filed under Uncategorized

6 responses to “Memory choices for the IBM System X x3750 M4 servers

  1. I’m a project leader at IBM Redbooks and the lead author of the paper on the x3750 M4 that you referenced, “IBM System x3750 M4 Implementation Guide”. I chatted with our development team about how they were able to achieve faster clock speeds that the Intel POR (plan of record). I’m pleased to report that it is simply the result of careful design and thorough testing.

    Each x3750 M4 memory bus was carefully designed on the system board and expansion tray to be shorter than that required by the Intel specification. The result is a memory bus with less signal loss as the memory data makes its way from the processors to the DIMMs. The shorter bus with less loss allows for faster speeds. So instead of running 3 DPC at 1066 MHz (for example) we were able to operate them at 1333 MHz. The key to success was to make sure that the processors are electrically central to all memory DIMMs.

    This design is then backed up with very thorough testing. Our test teams thoroughly test each and every DIMM (and DIMM vendor) we support in the server. Each DIMM is taken through margin analysis in which we vary voltages, timings and temperatures of the memory bus to ensure that we still meet Intel’s specification.

    I hope that answers your questions.

    David Watts
    IBM Redbooks

    • Dave,

      So to summarize as I understand it – you were able to:

      – shorten line lengths from processor to DIMM slots
      – make line lengths symmetrical from processor to the individual DIMM slots


  2. Pingback: Memory buying guide – when to use RDIMMs ? | ddr3memory

  3. Pingback: Where are the 1600MHz LRDIMMs/HyperCloud for Romley ? | ddr3memory

  4. Pingback: Latency and throughput figures for LRDIMMs emerge | ddr3memory

  5. Pingback: 1600MHz HyperCloud HCDIMMs in Q1 2013 ? | ddr3memory

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s