Multi-core has made it to cell phones. Doug follows up on some recent stories about low power cell phone clusters and high power multi-core memory performance
A few weeks back, I discussed the possibility of cell phone clusters. While cell phone processors continue to get faster, there are other issues that make them a poor platform for traditional HPC. As I mentioned, there are other “crowd sourcing” possibilities that may have some interesting applications for clusters of smart phones. Recently, Samsung announced the first dual core smart phone using the ARM Cortex-A9 Processor. The Cortex-A9 used by Samsung runs at 1 GHz. For an old timer like myself, that is amazing. My first cluster used 450MHz dual core processors.
The A9 is interesting, but the quad-core A15 is going to get ARM out of phones and into tablets and servers. Including things like double-precision floating-point, vector extensions, 1TB main memory, and ECC cache are not phone features (yet). I had written previously about low power server processors (Intel Atom) and HPC. There are issues of scalability, which comes down to the interconnect, but in general, the “low processors” (low meaning low power, low heat, low cost) may have a play in HPC. Along with the A15, you have other low processors including the AMD Bobcat, Intel Atom, and the Qualcomm Snapdragon. Keep your eye in this space, as the market pushes for more hand held computing there will be a lot of action with these “low processors.”
Moving on to the “high processors”, I want to follow up on last weeks column on my Gulftown multi-core tests. I noticed that I mis-labeled two tests in Table One. I mixed up the results for the “is” and “lu” tests. The corrected rows are given in Table One below. It has no effect on the averages at the bottom of the previous table. By the way, I use the NAS “A” size tests for this benchmark. I will probably start using the “B” size soon as some of these tests finish rather quickly.
Test |
2 copies |
4 copies |
8 copies |
12 copies |
16 copies |
lu |
2.0 |
3.9 |
6.5 |
6.1 |
6.7 |
is |
2.0 |
4.0 |
7.8 |
11.2 |
14.8 |
Table One: Corrected Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running the NAS suite
After I ran my set of tests, I recalled that the the Gulftown has Simultaneous Multithreading or SMT. Once enabled in the BIOS SMT doubles the number of cores seen by the OS. While many people may recall Hyper Threading (HT), SMT is supposed to be better and help hide memory latency. That is, while a core is waiting for memory acess, it can in theory be running another “thread.” This technique may be very helpful with I/O issues as well, however, most HPC applications hit the memory hard and there may not be much benefit from SMT. If you recall, the conventional wisdom was to turn of HT on Intel processors when doing HPC.
In order to be complete my tests, I decided to turn on SMT and re-run the 12-way and 16-way effective core tests. I only looked at the 12 and 16-way tests because I don’t see SMT having any effect on any number of processes that is less than the number of real cores. The results are in Table Two below. The 16-way results should be most telling because the real cores are overs-subscribed by 4 processes. With the exception of the ep benchmark, there does not seem to be any advantage to using SMT. Indeed, some benchmarks saw a decrease in effective cores. ep is more processor bound and thus shows a nice performance boost. As expected, there was no improvement when running the 12 core test using SMT.
Test |
12 copies |
12 copies (SMT) |
16 copies |
16copies (SMT) |
cg |
6.6 |
6.6 |
7.7 |
7.8 |
bt |
4.8 |
4.8 |
4.9 |
5.4 |
ep |
11.8 |
11.7 |
12.7 |
14.0 |
ft |
8.9 |
8.9 |
11.0 |
10.4 |
lu |
6.1 |
6.0 |
6.7 |
6.7 |
is |
11.2 |
11.0 |
14.8 |
12.6 |
sp |
5.4 |
5.5 |
5.7 |
6.4 |
mg |
6.6 |
6.5 |
9.1 |
7.9 |
Ave |
7.7 |
7.6 |
9.1 |
8.9 |
Table Two: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running the NAS suite with SMT enabled
In general, I don’t think SMT will hurt anything as long as you don’t oversubscribe the actual number of cores. It may allow daemons and other such background processes to work better on compute nodes, but I don’t see it making a huge difference (this assumption should be tested).
Moving over to the GP-GPU world, the NVidia GPU Conference is next week, I was going to attend, but I had a scheduling conflict come up. Look for some good stuff to come out of this event. Since I won’t be on the west coast next week, I will probably attend the one day HPC Financial Markets event in New York City. This show used to be called “High Performance on Wall Street,” which has a small, but free exhibit. Finally, I am amazed that my twitter following has continued to grow even though I really don’t “tweet” very much. Aside from mentioning my latest articles, I’ll try and add some interesting tid-bits now and then. It will probably be more like “then” than “now.”