Quick, write it down: just another CPU design idea
2021-05-05 23:57 cpuidea [permalink]
Ever since learning about the Mill architecture, and watching an excellent series of MIT 6.004 lectures by Chris Terman (I can't find them on Youtube any more! But there are newer versions of 6.004 online which probably are great as well.) and reading up on what's new about x64 and RISC-V and ARM, from time to time my mind wanders if you could go crazy and design yet another CPU design that does novel things with the umptillions of available transistors and not have the downsides of the currently popular CPU's.
So here's an idea I just need to write out of my system, and let you have a look so you can see if there's anything there at all. Modern CPU's use virtual registers to have hyperthreading and speculative execution, and push whatever ahead in the hope it was the right branch to make pipelining work better for the workload at hand. The Mill architecture would handle that a little different and fill the pipeline just with more concurrent threads and interleave those. (At least that's how I understood what would be going on.)
What if you design a CPU so that it also takes in a larger set of streams of instructions, also using a bank of virtual registers that get alotted to these streams, but also have hot memory just like the closest cache is, but use that for the stack. Ideally you could expect programs to keep the current stack-frame fit within one or a few kilobytes, and by having a larger number of streams you could avoid having to switch this stack along with all the rest on context changes (by the OS). Switching the stack and overflowing the stack to and from memory is something that will have to happen, so you'll have decent support for that, but if you're after speed you'll try to avoid it as much as possible.
Let's have some numbers just to get a clearer mental picture. Let's say a core has support for 256 streams of instructions. If the work for them is pipe-lines across the rest of the core, addressing them would take 8 bits, so you would see a band of 8 lines pipe-lined to everything everywhere. It's not even required to have 256 instruction decoders, and sub-groups of streams could share commen instruction decoders, but depending on the instruction set design itself, this could turn out to be the bottleneck, but let's find that out with further design and research.
Let's say each stream gets 4KiB of hot stack memory, in total is 1MiB which is not unrealistic to have in a modern code nowadays, if I understood that correctly. This special memory could have extra lines so it automatically flows over into system's memory if the local stack index rolls over, which perhaps could also help with loading and flushing stack data on context changes.
With instruction fetching, stack handling and these streams doing their thing with alotted virtual registers, there's a lot covered, but the ofcourse the sweet magic happens in ALU's and related things, so down the pipe-line, according to what instruction decoding prescribes, the streams queue up to get one of the available ALU's to handle an operation.
This would be a point at which you start on a real design in a simulator, but that's where my knowledge is stronger than my experience. I just decide to start writing things down (right here, right now), let it simmer a bit and maybe pickup starting a first attempt sometime later. If you see what a former attempt like plasm ("play assembler") looks, don't get your hopes up too much. For now it's good that I wrote the core idea down here. If I ever get down to it again and get something working, you'll read it here. Sometime later.