Wednesday, August 23, 2006

The UNTOLD Reason why AMD will have to go with Native Quadcore

While Intel is releasing its quadcore soon (expecting Q42006), AMD will release its native version of quadcore in 2 or 3 quarter later than Intel. AMD (and its fan) also tried to play down Intel's non-native version, and claiming the native approach is better.

What people failed to realize what if AMD using Intel's approach before come out with the native version, the end result will be a Internally NUMA quad core, which is bad for mobile, desktop, or even as a NUMA node for the server MP. So, it is really not a matter native if better than not native for AMD, it is just that the non-native approach is NOT good for AMD.

Why bad? As of current (and foreseeable future), there is not much (if not any) apps get wriiten for the NUMA optimization. And most desktop/laptop apps doesn;t require that level of memory bandwidth (the NUMA has better bandwidth but with a catch - need software optimization that doesn't work in all workloads). NUMA is a sensible thing in server MP, not in desktop or laptop. Having 2 memory link also raise the system cost, making it unsuitable for cost concern market.

For mobile specifically, it is mainly driven by form factor, power and wireless. 1P is definetely as solution for it. Having NUMA within the 1P, it just means it need minimum of 2 dimm and might not be a good candidate for certain very small form factor mobile device. The unncessary memory bandwidth in most mobile application also causing the power to be up, while not gurantee significant improvement. (I'm not sure if certain apps would show negative improvement)

For server MP particulaly, if this internally NUMA chip is used as a NODE, there will be multiple node distance in the whole MP design, which again making the software optimization harder.

10 comments:

Anonymous said...

Good point. However I do not agree with your take on Internal NUMA and External NUMA in MP scenario.

The only problem with NUMA is the non-uniform memory latency. However, AMD's two-hop latency will still be very close to the latency on Woodcrest, and hence they can definitely afford to do it this way. I think, the real problem is that, they cannot afford to sell quad-core parts before they hit 65 nm. They are capacity constrained, and quad-cores will strain their margins. 4x4 is a better way to go, they sell more processors.

Anonymous said...

I read on INQ that K8L might be delayed till early 08. Any other confirmations on that news. AMD's announcement is pretty vague. It's impossible to figure out if the quad-core next year is K8L or just a shrink of K8.

If K8L comes in 08, they run the risk of it being obsolete on arrival. Penryn will be on mature 45 nm by then. And the next Intel micro-architecture will be just around the corner.

pointer said...

However, AMD's two-hop latency will still be very close to the latency on Woodcrest

I do not know the 1 hop latency number, I'd think that it is quite the same as Intel's (i just simply imagine all Intel is about 1 hop anyway thru the FSB, but with better caching). Thus, I would highly suspect that the AMD's 2 hops will be close to intel's latency. But anyway, may be this no real concern as in the server space, the vendor is willing to invest to optimize the software.

4x4 is not hitting the server MP space, any delay to the server quad core, even claiming intel has non-native approach, will still making them lose some market share (one with the offering, one without any counter offering). Besides, I have serious doubt on the 4x4, may be i should run another blog article to explain that (by pulling all my blog comments :))

"Mad Mod" Mike said...

Pointer, you really have no idea what you're talking about, do you?

Internal NUMA Quad-Core? WTH does that mean? NUMA does need optimizations to be very effective, but you fail to realize that most OS's have a basic knowledge of NUMA and take care of the important tasks -- giving a decent performance gain in some apps.

NUMA isn't ever a downfall, nor is it required. There is only 1 Memory Controller per CPU, not per CORE, so there is no such thing as "Internal NUMA" -- nor would a non-native design hurt them. HyperTransport would be plenty of speed for coherency.

pointer said...

Internal NUMA Quad-Core? WTH does that mean?

internal NUMA means there is one extra NUMA node within the package.

NUMA isn't ever a downfall, nor is it required. There is only 1 Memory Controller per CPU, not per CORE, so there is no such thing as "Internal NUMA"

I put up this article explaining why AMD cannot put 2 die in a package as what intel did. if they do so, it will ended up as what i said in the article. I'm not trashing the NUMA here, but just some thought on why AMD MUST go native. And as what you said, one package should have one IMC, if they put 2 dies together, there will be 2 IMC link (resulted what i call internal NUMA :))

-- nor would a non-native design hurt them. HyperTransport would be plenty of speed for coherency.

as i explain in the article, the 2 IMC thing will rule out its usage in the mobile space. No softare or performance concern on this yet, it fails the form factor requirement.

go and refer to AMD's foil on their everything one node away MP approach. if there is internal NUMA, there will be more than one NODE distance. While server might wanna invest money in optimizing the software for it, it is bad for the desktop.

"Mad Mod" Mike said...

"internal NUMA means there is one extra NUMA node within the package."

A NUMA node is basically a memory controller, of which there is 1 on an AMD64 CPU. You are thinking they would have 2 Memory Controllers if it was Dual Die, and that is where you are drawing your conclusion from (I think). There would not be a problem, as they could easily link the 2 memory controllers together via HT and solve the issue. They would likely eliminate 1 Memory Controller by disabling it, thus to reduce power consumption and eliminate the need for multiple banks of memory.

pointer said...

There would not be a problem, as they could easily link the 2 memory controllers together via HT and solve the issue. They would likely eliminate 1 Memory Controller by disabling it, thus to reduce power consumption and eliminate the need for multiple banks of memory.

while this is a possible solution, but it make its quad core run much slower and thus defeat its purpose. The memory latency for the die that has IMC disabled will be much larger and AMD does not have bigger cache to compensate that. Again, it is a tough work for software to optimize for it. Another concern is that AMD's dualcore IMC is designed to work with the dualcore workload. having the other die 100% routing its traffic through the IMC would likely overload it. AMD really has to go native for its quadcore.

ashenman said...

I think it's a combination of the increase latency and their manufacturing capacity. While they could simply ask chartered to ramp a bit faster, they'd pay for it in profit per chip, as they'd probably have to pay chartered much more for the processors they get from them. AMD is simply working on getting its name out there right now, which is what system integrators are doing a pretty good job of at the moment. This takes a lot of processors and does a number on your ASP.

Since AMD has to strain its manufacturing more to get enough chips, it can't afford to halve any portion of its production in order to slop two chips together for performance that would suffer from said latency. While 4x4 may seem to contradict this, it's only a small amount of production that is basically just an Opteron. I don't even think they technically have to mess with the memory controller. (Though it would be interesting to hear what they would need to change to remove the need for registered memory). So production doesn't really change. I bet we'll see higher clocked opteron releases that correlate with the new fx releases. If so, then this would validate my theory.

Scientia from AMDZone said...

If I'm understanding your reasoning, you are suggesting that AMD would need two memory controllers for MCM and that this would make it too expensive for mobile. The problem with this idea is that Intel doesn't have MCM for mobile either. Yonah and Merom are both native dual core.

MMM is correct that applications are not written for NUMA optimization. NUMA optimization is performed by the operating system.

Your statement about one hop is incorrect as it takes two hops for Opteron 8-way systems. This will be reduced when AMD adds a fourth HT link with DC 2.0 in 2008. This could also be improved in 2007 by making use of the split mode for HT 3.0.

Finally, the extra latency from having an additional node is less than the latency that Intel gets on servers by having all FBDIMM slots filled.

So, now your entire argument has been essentially disproven and we are back to the point you started with. AMD is doing native quad core because it performs better, not because it has to. MCM would be about the same amount of effort for AMD as native. However, the MCM approach would be slower.

pointer said...

If I'm understanding your reasoning, you are suggesting that AMD would need two memory controllers for MCM and that this would make it too expensive for mobile.

Nope, that's not what i said. It is just because each of its dual core has 1 IMC. being MCM, it would have 2 IMC. Mad Mod Mike raised a question that saying 1 IMC can be disabled (which is sensible so that pin compatible), but it will have a hit on the performance.

Finally, the extra latency from having an additional node is less than the latency that Intel gets on servers by having all FBDIMM slots filled.

So, now your entire argument has been essentially disproven and we are back to the point you started with. AMD is doing native quad core because it performs better, not because it has to. MCM would be about the same amount of effort for AMD as native. However, the MCM approach would be slower.


what are you talking about? I said the MCM approach for AMD is slower or not suitable to a point that AMD has to go native. your comment doesn't support anything on AMD MCM approach is a marketable stuff.

For the sake calrity, i'll just requote what said:
1) if 2 IMC link, not a good form factor for Laptop, bad optimization problem for laptop + desktop, and one addition NUMA node that the MP server has to take care off. (one thing i lazy to post here is the no pin combatibility with the previous chip)
2) if 1 IMC link (one diabled) it make its quad core run much slower and thus defeat its purpose. The memory latency for the die that has IMC disabled will be much larger and AMD does not have bigger cache to compensate that. Again, it is a tough work for software to optimize for it. Another concern is that AMD's dualcore IMC is designed to work with the dualcore workload. having the other die 100% routing its traffic through the IMC would likely overload it. AMD really has to go native for its quadcore.