Cache cache cache, turns into cash cash cash…
At least that’s the general theory and in most instances, it seems to be right. For cryptonight coins, 2MB of fast memory is needed per thread. Depending on the CPU architecture, this could be L2, L3 or even L4 cache. In my short journey into the world of crypto mining, I haven’t found this to always be the case. While I don’t have any screen shots and the numbers I will talk about have to come from my beleaguered memory, they are close enough to get the point across.
Where it all started
The first system I got that was dedicated for mining was an HP Proliant DL580G7 with X7542 processors. I bought it because others said they were mining with it and since I like to be a dense blockhead, I figured dual and quad cpus systems were the way to go if I could GPU mine with them as well. I didn’t realize then that AES instructions weren’t in every Xeon chip so I had no idea that the server I just bought had garbage performance when it came to CPU mining. This thing would JUST touch 400H/s at like 580W on the X7542s… LOL So after some more reading and seeing a post by someone else, I ended up with a set of four E7-8837 processors for $35 shipped. While not hyper threaded, these 8 core CPUs have 24MB of L3 cache and will actually yield 1650-1700H/s at roughly the same power consumption as the X7542s. For $35? Hell yeah!
More is better right?
So everyone on the Interwebs, forums, and this new fangled thing called Reddit and Discord (Can we just go back to AIM and mIRC?) state that optimum performance is basically derived by fast on die memory divided by 2. This should mean that more L2/L3 cache the better right? Well, maybe not so much. I spent a lot of time poking around Intel’s Ark site to figure out what would work with the bulk purchase of server equipment that I picked up at government auction. For the HP DL580G7 boxes, the king of the heap for the socket 1567 appears to be the E7-4870 processor. This 2.4GHz 10 core monster packs a whopping 30MB of L3 cache and it is hyper threaded. When you compare it to the E7-8837’s that are 2.6GHz, 8 core and 24MB of L3 cache, you would expect it to pretty much put up 20-25% higher numbers. You know that trumpet sound that plays when someone fails? That’s what the E7-4870 has gotten in every single test I have tried.
How it went down
I started scouring my favorite place to cop tech deals, good ol faithful, Ebay. I missed out on a couple of auctions for this mystical beast of a processor which in the long run probably was for the best. I ended up finding some guy trying to sell a big lot of CPUs including 12 of the 4870s. I negotiated to take them all for $420 shipped since by this point I already had two DL580 G7s running with plans for more. When they finally came in I was oh so super excited. I rushed to power down the one box (that also had four GPUs in it by now) so that I could pull the processory / memory cassette and swap in this 30MB goodness. I mean with 30MB of L3 cache and four processors, this should be 60 threads at like 35-38H/s which would have totaled 2100 to 2300H/s! Yes it sucks down more juice than a Vega frontier but it would cost me less than $500 per box and I’d have a bunch of extra stuff like PSUs for GPUS that I wouldn’t have to buy.
In five minutes I had the cassette out, the heat sinks off, the new chips in and greased up and I was pushing the power button. Now, if you have never used any of the HP server iron like the Proliant DL580 G7s, you wouldn’t know that these things take FOREVER to boot. Before the system BIOS ever comes up, there is a whole subset of routines that fire up for remote management and control. Well, after 15 minutes, I finally gave up hope that the server was going to come up. I tried the memory swap game, the take out some cpus game, reset the nvram, you know, all the regular troubleshooting things you would do when a box won’t boot. Finally, I decided that the box probably needed a firmware update so I threw the 8837’s back in because mind you, 2 hours have now passed and 4 GPU’s haven’t been working.
WTF with a side of greedy corporate paywalls
Having been around computers for the majority of my life, I’m pretty familiar with the concept and practice of updating firmware and BIOSes for a variety of devices. Never have I see what I encountered when I simply wanted to take hardware that I purchased and update the system so I could use newer stuff. HP in it’s quest to maximize it’s bottom line has decided that certain updates are pay for only. Yes we built this. Yes we sold it and made our money on it. The firmware’s were developed and paid for by existing support contracts but let’s make some extra money by only allowing you to get the firmware if you have a support contract. That’s a big load of bullshit in my opinion. As a businessman I understand the concept, but I feel that is purely a cash grab and an attempt to protect their service contract business. Anyways, if you seek you shall find and find I did. 12 hours later, I was finally able to get the box going with the 4870s.
Talk about blue balls
Have you ever brought home a really hot chick? One with whom you had to work really hard to try and land? How would you feel if after all the effort and anticipation, she just passed out on your bed and didn’t perform? That’s how I feel about the 4870s. I paid $35 each for these 4870s and they were doing 1450H/s. I played around with so many different thread settings. I tried different memory configurations. I tried different BIOS settings and even tried disabling the hyper threaded cores. My rational was, shit, this is just a +2 core +6MB L3 cache 8837 if hyper threading is turned off. You would think that the thread config would be similar in this scenario, just add a few more threads. and match thread number to cores available and adjust memory size per thread to eat up all the L3. Nope Nope and Nope. No matter what I did, I just couldn’t get this series of self discovered wunder chips to perform any where near their paper potential. Processors that cost me the same amount for a set of 4 versus what I paid for one of these chips were obliterating the 4870s in the cost / performance ratio.
Now that I’ve deployed a decent number of boxes, management is becoming an issue. It’s no longer a simple task when I have to replicate changes across 20 workers manually. I saw somewhere that someone mentioned a fork of XMRig that included command and control features that appear to be exactly what I need to drastically reduce redundant actions. When I started digging into XMrigCC, I noticed that in the benchmarks, they listed the E7-4820 as doing 2200H/s in a quad config. WHAT? HAS ALL MY RESEARCH BEEN DEBUNKED AND I’M JUST AN IDIOT WHO CAN’T FIGURE SHIT OUT? Oh and did I mention I had already started selling the 4870s in sets of four on Ebay and was on my last set? So I snagged XMrigCC and installed the last set of 4870s I had for some final testing. I only had a few days left because someone had already put in a winning bid on the auction for those processors.
I first started my testing under Windows, XMrig wasn’t detecting all four CPU’s but it was using the proper number of threads. It started to look promising but again, I ran into the the same limit of just under 1500H/s. Since some people say that there are performance gains to be gained under Linux, I loaded Ubuntu 16.04 LTS and tried again. I was still getting the same results. I even wrapped my head around the binary to hex conversions for proper cpu affinity and masking but to no avail. I then reached out to Ben Dr0id, who coded the XMrigCC fork who spent a few hours with me trying to diagnose the issue.
I learned a few new Linux tips like HTOP or Hwloc’s lstopo (which is totally cool when done in under a GUI but not so much from the command line :D) and what we discovered just didn’t make any sense. As I thought, the 8837 and 4870 are derived from the same architecture with very similar L1/L2/L3 cache structures. However, all similarities end there because there is some kind of bottleneck we just couldn’t nail down with the 4870 chips. In many ways, it looks like that because of NUMA, the processors are attempting to use the L3 cache of other processors over the QPI interconnect, slowing things down tremendously. XMR-Stak has consistently beaten XMRig in my testing for CPU mining. I don’t mine cryptonight coins on my Nvidia cards so I can’t comment on GPU performance.
The auction came to a close and I boxed up the last of my 4870 CPUs. I was glad I hadn’t lost money as I was able to turn a slight profit selling the 4870s in sets of four, but I was disappointed that what should have been freaking awesome on paper, sucked in reality. I was really hoping for a Dell PowerEdge R815 killer in both performance and price, but alas it was not to be. I still really dig the DL580s though because even though so far I’m limited to 1650H/s on the CPU’s, they are a steal of a deal when you consider they can drive 11 GPUs with an addon PCIe expansion board. Not only that, but they normally come with four 1200w PSU’s, of which, one will run the box and four 1060’s giving you 3 for more GPUs when coupled with breakout boards! The icing on the cake has been the average $300 cost + $40 CPU upgrade that I’ve been paying and they sure do ROI faster than the R815’s!
If you have experienced something different with the Westmere EX Xeon series chips with 10 cores, please, let me know!
2 thoughts on “They say L3 cache is king… Is it?”
This is really informative. Thank you. I can confirm, there is something fishy about the E7 4870 processors. I thought for sure I was doing something wrong, when I attempted to use the l3 cache on one CPU that affected the processing on a SEPERATE CPU. I do not yet know how / why / what about the QPI interconnect yet, but I will.
It took me forever to figure out how to configure memory on these large 4x CPU machines too. Between the memory channels, ranks, and how it can be configured – is like a 30 page book for the Dell r810 machines. It is going to take me a while to study and understand how it works. I believe the bios and ram configurations can affect how the system channels that CPU cache. I believe it was designed to use as much of it whenever it is needed automatically.
I have definitely seen RAM population strategies affect speed AND power consumption on the Dell R815 Opteron systems I have. You have to populate both data channels for each processor with two DIMMs for 4 / CPU and 16 / box. There is even a difference between using single and dual ranked DIMMs. With the Xeon’s in my DL580s, I’ve pretty much been able to strip them down to just two sticks per memory cassette without affecting overall speed much but saving a decent amount of power.
I’m saving 50w per R815 by moving from dual ranked registered memory to single ranked unbuffered sticks.