[Pc_Support] Re: Good read on the future of writing code --
programmer viewpoint (i.e., gross ignorance ; -)
Justin M. Keyes
justinkz at gmail.com
Wed Jun 28 11:03:22 EDT 2006
> From: Bryan J. Smith <b.j.smith at ieee.org>
> Date: Jan 23, 2005 3:13 PM
> Subject: [Pc_Support] Re: Good read on the future of writing code --
> programmer viewpoint (i.e., gross ignorance ;-)
> To: "This is the PC Support list." <pc_support at matrixlist.com>
>
>
> On Sun, 2005-01-23 at 03:33, paddy wrote:
> > Bryan;
> > Don't know if you have come across this article but it seems that , once
> > again, times have changed .
> > http://www.gotw.ca/publications/concurrency-ddj.htm
>
> First off, this is an article written by a programmer for programmers.
> Programmers _think_ they understand CPUs, but they do _not_. Data
> organization != microprocessor design -- far from it.
>
> For example, "HyperThreading" is an Intel _marketing_ approach by which
> the OS uses the chip _inefficiently_ to do what the CPU should do
> _internally_. Intel uses this because they have yet to build an x86 CPU
> with register renaming, out-of-order execution and other goodies, unlike
> AMD. HyperThreading is _not_applicable_ to newer CPUs -- only old CPU
> designs like Intel's 12-year old 7-issue i686 (Pentium Pro).
>
> Modern CPUs use _internal_ optimizations that are far _more_ efficient
> than using the OS to "context switch" between 2 different threads like
> they were full CPUs with their own registers and units. That is very,
> very _inefficient_, which is why HyperThreading can often result in
> _slower_ performance than without it on Intel x86 CPUs (because of the
> added CPU overhead). This ignorance en-masse just drives people like me
> up-the-wall (and don't get me started on "Centrino" technology ;-).
>
> Now, back to the article, here is what is _really_ happening in the CPU
> world:
>
> 1. More run-time optimization in-chip
> 2. Virtualization
> 3. Removal of the traditional instruction set (whoo hoo!)
> 4. Removal of the clock (finally!)
>
> #1 is obvious, Intel is try to "modernize" their x86 CPUs for the first
> time in 12 years. The i686 (Pentium Pro) IA-32 was designed to scale
> from 200MHz to 1GHz as Intel thought the IA-64 Itanium would be out by
> 2000 and replacing the i686. Unfortunately, IA-64's pure
> EPIC+Predication (pure programmer concepts) _failed_ to work in silicon
> as designed (again, too many programmers not listening to engineers ;-),
> and Intel was stuck "retrofitting" the i686 IA-32 into P4 IA-32 and
> "Yamhill Prescott" IA-32e. Now Intel has temporarily gone back to the
> original i686 which has been extended to 1.5GHz and even 2.1+GHz with
> other techniques than inefficiently extending the pipes (which is what
> the P4/Prescott do, but can't pass 3.8GHz).
>
> Intel is basically working on "Yamhill2" will finally bring register
> renaming, out-of-order execution (OO) and a better branch predictor unit
> (BPU) to the scene. If Intel succeeds in this by 2006, AMD will have
> its first _real_ competition in an _solid_ ALU+FPU, and not anymore
> _joke_ of SIMD instructions that are 100% _marketing_ (largely to
> programmers).
>
> AMD isn't Scott-free either though. Their aging Athlon core released in
> 1999 was only designed to scale from 500MHz to 3GHz, and 3GHz is turning
> more into 2.6GHz right now. Although not as old as Intel i686, it's
> still 6 years old, and a new architecture is warranted. But it's
> obvious AMD new this, because they designed Opteron with limitations
> like only 40-bit physical addressing (reusing EV6) and 1MB of L2 (even
> though EV6 is capable of 8MB) and other details, because they knew it
> would not be around for too much longer. Because the sheer design of
> i486-compatible 48-bit addressing AMD x86-64, including the PAE52
> "programmer" model just suggests virtualization, which brings us to the
> next point.
>
> #2 is virtualization. x86 is inefficient and a PITA. Plus there is the
> real issue that to maintain i486 TLB compatibility, only 48-bit or
> 256TiB. 256TiB is closer than you think. In fact, Opteron only
> supports 40-bit or 1TiB (EV6), and 1TiB is 64 x 4-way, 16GB Opteron 800
> systems in a HyperTransport cluster. And that's before we even look at
> memory mapped I/O considerations, which the Opteron does on-chip too.
>
> Virtualization is the _evolution_ of multi-core (and well _beyond_
> hyperthreading for that matter, which is an Intel hack for an early '90s
> CPU design ;-). It is the concept that there is a main "control unit"
> that offers two or more "virtualized" instances of x86-64. How the OS
> will uses these instances, it's up to the OS, but it's very likely that
> there will be a virtualizing OS to match, that handles the main "control
> unit," and then virtualizes multiple "real" OSes underneath it over
> various, dynamically assigned CPU "instances." In fact, VMWare already
> offers its own, standalone OS for doing this on large-scale (4-32 CPU),
> shared memory systems. So it will be the first candidate.
>
> For compatibility, I'm sure when running a single, "real" OS,
> virtualized CPUs will look like multiple CPUs, just like traditional
> multi-cores do. So you'll just need an OS with support for how many
> cores it can provide. Maybe the OS will offer advanced support for the
> main "control unit," so it can offer additional features. But for all
> intents and purposes, it is multi-core because one OS controls _all_
> CPUs/cores.
>
> #3 and #4 bring us back to the original problem of the microprocessor --
> designed by programmers for programmers.
>
> Before the integrated circuit, programmers uses transistors to build
> boolean gates. Thinking arithmetic, they build circuits that took
> operands and put them through an operator -- thus, the Arithmetic Logic
> Unit (ALU) was born. The operator+operand was known as an
> "instruction," and it was coded into some variable length stream of one
> or more bytes. With the integrated circuit, we needed a "control" --
> the part that replaces the mathematician and says "go to the next step,"
> we got the "clock."
>
> The ALU did one thing at a time. It completed all traditional steps --
> fetch, decode, execute -- all done step-by-step on the "clock" -- before
> the next "instruction." Some processors also offered an equivalent
> Floating Point Unit (FPU) for decimal arithmetic, and even trigonometry
> later on.
>
> Once we moved from transistors to integrated circuits to the eventual
> microprocessor, this was carried over. The "instruction" or "machine
> code" was a 1st generation (1g) programming language, designed by
> programmer for programmers. Rather obviously, even before the
> integrated circuit, people came up with a 2nd generation (2g)
> programming language by creating English memonics that could be
> "assembled" in 1g machine code "instructions." Assembler has the same
> rules of the ALU, which meant each assembler language for each CPU was
> different.
>
> We call this complex instruction set computation (CISC) -- very complex
> instructions designed to do many details in hardware that programmers
> needed. In the early '80s, the Intel 8088 and 8086 took over, and its
> "x86" instruction set of variable length "instruction words" of 8-bit to
> even 160-bit, became widespread.
>
> By the early '80s, many physicists and engineers realized we made the
> biggest mistake in the universe. The programmer instruction set is
> _not_ idea for computing. It was clear the concepts of pipelining
> (multiple stages being executed simultaneously) and superscalar
> (multiple pipes where stages could be pipelined) were never designed for
> sequential "instructions." The instructions were way too complex. Some
> instructions took a few cycles while others took dozens. Worse yet,
> they were variable length -- which is the killer from the standpoint of
> multiple MEAGs/"one-hots" (which selects the unit/datapath), let alone
> their optimization.
>
> Luckily for the physicists and engineers, use of 3rd generation (3g)
> languages were becoming widespread. The C compiler had become the
> absolute of all other higher languages. So as long as programmers used
> the C compiler, or another language that output its instructions into C
> code. This allowed the creation of reduced instruction set computation
> (RISC), by which instructions were simple, closer together in number of
> cycles, and had _fixed_ operator and fixed options/operand that _only_
> worked on registers (_never_ memory other than LOAD/STORE).
>
> The programmers chastized RISC, but programmers no longer designed CPUs,
> engineers did. Engineers who had been educated in semiconductor layout
> and analog concepts like electromagnetic fields (EMF), which were a real
> issue. By the late '80s, CPUs could no longer simply be designed with
> digital logic -- the analog effects of sub-500nm had to be considered
> too. And because of the C compiler, the engineers could now present a
> "3g" interface to programmers, because there was no way the programmers
> could understand how to optimize code for a superscalar RISC processor.
>
> The first two RISC designs were Berkeley RISC and Standford MIPS. The
> former was first, but the latter was first-to-market. Ironically, the
> company named after the Stanford University Network (SUN) which sold,
> like everyone else, various MIPS1000, 2000 and even some 3000, systems,
> choose to develop Berkeley RISC into today's SPARC. Sun, who had
> quickly become a major provider of growing Internet systems by the late
> '80s, basically demolished the standardization efforts around MIPS.
> >From there, IBM and Motorola collaborated on a PC-oriented instance of
> IBM's Power, the PowerPC. SGI stuck with MIPS. And Digital and Intel
> started to collaborate on a new processor design.
>
> At the time, Intel had already committed itself to a 5-issue,
> superscalar x86 implementation known as Pentium (not only for the 5th
> generation x86, but also for the 5-pipelined design -- 2 ALU, 1 FPU).
> Digital was building the world's most anal RISC design, a 7-issue, true
> 64-bit chip that only had 32-bit and 64-bit instructions, not even a 8
> or 16-bit LOAD/STORE (although these were added in the 21164 due to
> programmer demand ;-). Intel looked at Digital and thought they could
> due better with their EPIC+Predication approach (below), and broke the
> agreement. But not before taking many of the concepts and building a
> new 7-issue "sister" to Pentium in the Pentium Pro. This, of course,
> resulted in a lawsuit a few years later.
>
> Digital, well ahead of its time, recognized that instruction set
> emulation could be achieved with good performance and perfect
> compatibility. The only constraint was that the emulation had to be on
> the _same_ operating environment, with the same operating environment
> structures and objects. That's why Alpha, unlike almost any other
> modern processor, was designed with *0* backward compatibility. Digital
> found that on its new 64-bit RISC AXP instruction set, they could not
> only fully emulate 32-bit CISC VAX instructions, but they could "binary
> recompile" them into new 64-bit RISC AXP instructions that ran
> _native_. Again, the only constraint was that this had to be on the
> same OS, so the same data organization could be guaranteed. Most famous
> of this was FX!32, which allowed NT/x86 programs and even low-level
> services to run on NT/AXP. Linux and other processor/OS "binary
> translation" followed as well.
>
> [ NOTE: Transmeta's VLIW architecture exploits this concept. Build a
> very generic, anal RISC architecture (with a huge 128-bit RISC
> instructions, most RISC is 32-40-bit instructions), and then embedded
> the translation software in firmware, and wala ... you have a processor
> that can emulate _any_ other processor, as long as the software is for
> the _same_ OS as the firmware (typically the firmware is good for
> Windows and, of course, Linux). There is also a native Linux/TS port
> too (no translation). ]
>
> Because Alpha was so anal, the 21064 ramped up speeds to 300MHz
> overnight, the 21164 to 500MHz and well beyond after that.
> Unfortunately, just as the 21264 was in development, and Samsung
> produced a 1.2GHz 21164 _years_ before AMD or Intel did the same,
> Digital changed focus in 1996. CEO Palmer started selling off
> everything, first the fabs of Digital Semiconductor and then the Alpha
> itself, to Intel. The Alpha remained largely at 350nm, and Samsung was
> too busy with memory parts to dedicate its fabs to anything else.
>
> The reason Intel broke with Digital is because they were already talking
> to HP. Like IBM brought Power to Motorola, HP brought PA-RISC to
> Intel. Intel liked HP's idea because it was x86 compatible in hardware,
> whereas Digital was talking software-based "binary translation" that was
> unproven at the time. HP-Intel had a new idea for RISC. Although RISC
> was far better at keeping superscalar pipes full than CISC, it was still
> only about 60-70% compared to 30-40%. That means in RISC, 30-40% of the
> stages of various pipes in the CPU still go unused. There was also the
> issue of a branch mispredict, by which the CPU completely stalls as it
> has to flush all pipes to ensure no incorrect computation remains. In a
> 7-issue, 10-stage superscalar architecture like the i686 -- that's up to
> 70 stages (virtually hundreds of equivalent cycles), and pipelines were
> only getting longer.
>
> [ NOTE: Branch prediction is a necessity in superscalar architectures.
> Because different pipes are executing different pieces of code
> (especially so in OO/register renaming designs), as well as pipelines
> are fetching and executing latter instructions in the pipes themselves,
> the CPU is often executing on data beyond a branch. As such, the branch
> prediction unit (BPU) ensures the correct branch will be taken. If not,
> everything's gotta go because the CPU was executing code with values
> that have most likely changed. The difference between the tested 94%
> accuracy of the i686 BPU and the 97-98% of the Athlon BPU is orders of
> magnitude. Not good when you have something like the "Prescott" with
> 7+2 (2 SSE) pipes of 40 (yes _fourty_) stages -- bam! There is a 6%
> chance that over a thousand cycles will be _wasted_ everytime the P4
> hits a branch.
>
> Compare this to the 3% that 9 pipes of 18 stages will be wasted in the
> Athlon, let alone the Athlon has "stateful" mechanisms that can save
> some stages. The Athlon has _internal_ register renaming and OO logic
> that can keep track of multiple, independent data flows so it can
> identify what code is and is not affected by a branch mispredict.
> Intel's HyperThreading does this at the _OS_ to do the same for 2
> threads explicitly, because it lacks the features internally. So a
> branch mispredict in a HyperThreaded CPU is about half of the cycles,
> whatever cycles were working on the thread affected. The "cost" of
> Intel's approach is that there is a _massive_ overhead cost in OS
> context switching, and the software must be well threaded to take
> advantage of it. Lastly, HyperThreading is _never_ as efficient as 2
> real CPUs, let alone _never_ as efficient as having a _real_ internal
> design with register renaming and OO. ]
>
> The HP-Intel solutions was known as Explicitly Parallel Instruction
> Computation (EPIC) with Branch Predication. EPIC relies on 100%
> compiler-time optimizations to schedule 3 instructions to execute
> simultaneously. And instead of Branch Prediction, they used Branch
> Predication, execute _both_ paths and discard the result. The software
> simulations worked well for both, as the programmer ideals behind them
> said they would. When Intel first announced the first Itanium in
> "Merced," and the details of how IA-64 works, Digital Semiconductor
> instantly lambasted it.
>
> They said that Intel could not solely rely on compile-time
> optimizations, and that branch predication would waste cycles, more than
> the cost of a branch mispredict. They predicted that Intel maybe would
> reach 80% efficiency with EPIC, but nothing as good as RISC with
> run-time optimizations. And the idea to make IA-64 partially PA-RISC
> and partially x86 wasn't going to work, either you make it fully
> compatible (not worth it in Digital's eyes), or you use "binary
> translation" which Digital was firmly committed to.
>
> Shortly afterwards, most of Digital had been broken up. Some from
> Digital Semiconductor started Alpha Processor, Inc., which eventually
> became API Networks (now an east coast R&D arm of AMD). Others went to
> Samsung, AMD, etc... AMD was already working with Digital on licensing
> the 40-bit/1TiB EV6 interconnect and building the first non-32-bit/4GiB
> Intel GTL/GTL+ (i586/i686) PC bus signaling platform. And they had much
> grander schemes for 64-bit, but x86 compatible, and virtualization. [
> And this was just 1997 ]
>
> Intel was dedicated to EPIC. 5 years after Digital released its first
> 500MHz 21164, Intel released the Merced. It's FPU couldn't even muster
> the performance of what the 500MHz 21164 was capable of, let alone what
> the 1GHz 21164 could, and the forthcoming 667MHz 21264 would do. EPIC
> failed to keep the pipes full. Branch predication ended up costing more
> cycles than it saved. And Intel quickly realized that Digital was
> right, compile-time optimizations were based on optimizing _programmer_
> concepts -- the instruction set -- not physics/engineering concepts, the
> P/N substrates and even higher digital logic into usable units.
>
> The 90nm Itanium2 was released just a few years ago at speeds up to
> 833MHz. It's competitor, released a earlier, was the yester-year fabbed
> 250nm Alpha 21264 at much slower speeds of 667MHz and 733MHz. The
> Itanium2, with retrofitted OO and a new branch predictor unit was even
> slower versus the 21264 at floating point, than the original Itanium wa
> against the 21164. But the Itanium2 was still adopted for many 32+ way
> systems, because it was cheaper than RISC processors like the Alpha
> produced in far less quantity. _All_ of these systems use proprietary
> NUMA interconnects, not the standard "shared bus" of the Itanium, much
> like proprietary NUMA Xeon systems in the past, instead of GTL+.
>
> Last year Intel released a new "software enhancement" for Itanium2. It
> allows x86 instructions to run much faster in software than hardware.
> What is it? Yeah, that's right, Digital FX!32 adopted Itanium so it can
> run NT/x86 on NT/IA-64, Linux/x86 on Linux/IA-64, etc... faster. Funny,
> no?
>
> Which brings us to "Yamhill." Most in the IT industry just thought
> Yamhill is an AMD x86-64 instruct set clone (with the processors known
> as AMD64), what Intel calls IA-32e (with the "Prescott" processors known
> as EM64T). One major problem there. AMD64 is more than an instruction
> set, it's a NUMA/multi-bus, with EV6 (40-bit/1TiB) addressing. EM64T
> still relies on GTL+ (or AGTL+, which is Rambus signaling for Socket-423
> and LGA-775). GTL+ is _strictly_ 32-bit/4GiB, with a paging option for
> PAE36 (32-bit/64GiB) because of how the i486TLB uses a 32-bit offset
> from a 16-bit segment (whereby the last 4-bits of the segment overhang
> the 32-bits of offset). But it's still, physically 32-bit (even if
> programmers think it is 36-bit at the register).
>
> _Never_ deploy EM64T where more than 4GB is used, especially not for I/O
> intensive applications. EM64T lacks an I/O MMU, so it cannot safely
> guarantee memory mapped I/O above 4GiB. AMD64, heck even AMD32, use a
> 40-bit I/O MMU (this is the on-chip AGPgart in AMD32, which must be
> on-chip in point-to-point "switched" EV6, not on-chipset in "shared bus"
> GTL+) to guarantee DMA/DiME (DiME is for AGP/PCIe) cache-memory
> consistency between CPU and I/O (especially AGP/PCIe which acts like
> like a CPU, directly using system memory simultaneously with
> processors).
>
> I predict that like Pentium that saw the Pentium Pro developed almost
> simultaneously, "Yamhill2" maybe call it "Pentium 64" is also in
> development. The first "Prescott Yamhill" was like the P4, a quick, 18
> month refit. The second "Pentium 64" will be the new virtualized chip,
> one that shares the same 53-bit interconnect and LGA socket with
> Itanium3. You'll have a variant of the IA-64 "control unit," with
> x86-64 instances. And these new x86-64 instances _will_ have modern
> register renaming, OO and a new BPU. If you're running a legacy OS, the
> instances will just look like multicores.
>
> And it will compete better with AMD's new 64-bit processor series. The
> playing field will be even once again, although Intel still leads AMD in
> fab technology some 9-12 months, so Intel _may_ be on-top. Unless AMD
> completes their design first, and something tells me they will (because
> x86-64's PAE52 (52-bit registers for "programmers") was designed for
> virtualization -- something Intel didn't realize until _after_ AMD made
> it public).
>
> - But after x86, instruction sets are _dead_
>
> Of course, now we're still talking CISC/RISC operator+operand
> instruction sets. Virtualization will slowly but surely remove them,
> because the "main" chip won't have a traditional instruction set, it
> will just have its x86 "instances" that do. If you want to see the
> difference between "programmer" thinking and "engineer" thinking,
> compare assembler/machine code to VHDL/Verilog. In a nutshell, the new
> "instruction sets" that will _not_ be doable below 3g (the C compiler)
> will look more like VHDL/Verilog logic, and not traditional
> operand+opcode approaches.
>
> [ I'm not saying VHDL/Verilog will be how they are programmed. But the
> concepts are much closer to the circuits than traditional
> operator+operand machine code. ]
>
> The programmer has left the microprocessor building ... yes!
>
> In reality, the _first_ processor that will do away with instruction
> sets is the new Sony "Cell" processor. Because it needs a special new C
> compiler for this new instruction set, and an OS that is built on it,
> guess what this solution is? GNU/Linux of course. Since it's an
> embedded system, they only need a minimal OS, and whatever
> libraries/support they build around it. Like Sony, Nintendo is already
> using GNU/Linux as their development platform for their current
> platforms. So it will be interesting if Nintendo goes to GNU/Linux in
> the platform itself (let alone the architecture choice) like Sony is
> with the Cell-based PS3.
>
> It is uncertain what IBM-Microsoft are up to with Xbox2, but Xbox2 will
> _not_ be a PC (and probably not compatible either). Last I heard it's a
> new variant of Power, possibly a non-traditional instruction set too.
> Unlike GNU/Linux, this is going to be a real PITA for Microsoft, because
> right now the Xbox is DOS+Win, and Sega already tried a console with the
> CE kernel which largely sucked. Maybe a new Embedded NT port?
>
> These new "Cell" approaches (I need to read up on the patents because
> I'm not familiar with how they work), probably feed right into the
> virtualization approaches too. So it could be very possible for Sony to
> offer some PS1/PS2 compatibility. Who knows?
>
> - The clock as got to go ...
>
> Intel has already adopted asynchronous techniques inside of some logic
> so the gates are not on the clock -- just the gates before/after. This
> is how Intel scaled i686 up to 1.4GHz in the P3 originally. I would not
> be surprised if that is how they are doing it now with the Pentium-M --
> which is _still_ the traditional i686. The Pentium 4 has basically
> _failed_ to reach 4GHz (much less the originally promised 10GHz), and
> Intel is going with this Pentium-M refit of the aged i686 core. So
> Intel's familiarity with clockless gates may give it an advantage over
> AMD in the near future.
>
> The clock causes a massive EMF to be when all the gates switch.
> Theoretically, clockless circuits should use 1/4th the power. In
> reality, because a clock is still involved outside the chip for
> interconnect, 50-60% is typical because some gates inherently switch on
> similar cycles. At least this is what we found at Theseus Logic with
> our NCL08 chip. The lack of EMF generation with clockless is why
> Theseus Logic has some work in the smartcard industry.
>
> At the same time, clockless chips are far more tolerant of EMI than
> clocked. The gates are not relying on a signal to switch, so they
> cannot be "set off" by a stray EMF or other signal. As such, Theseus
> Logic's first partner was a pacemaker vendor, and they are currently
> partnered with major aerospace firms for defense applications. Their
> 3NCL (Null Convention Logic) dual-rail (0/NUL, 1/NUL) approach is
> considered one of the easiest to implement in the industry, whereas
> aproaches like packetized boolean (still single-rail 0/1, saving space
> over dual-rail) like that in Furber's Amulet (clockless ARM) still has
> timing closure issues. Steve Furber is also on Theseus Logic's Board
> (has been since 200).
>
> [ Personal Note/Jab: I can understand all British locale vocals
> _except_ Manchester (west/northwest), although Furber has been around
> enough that his is has been tamed a bit. ;-]
>
> The Semiconductor Industry Association (SIA) predicted that "timing
> closure" would require clockless adoption by 2006, and processors will
> have to be _completely_ clockless by 2012, to maintain Moore's Law.
> Clockless adoption is happening by where partial and even entire units
> are now having their gates switch independently of a clock, and only the
> portion or entire unit itself is on the clock. E.g., the clock is still
> used for knowing when to push through the next stage. Intel seems to be
> doing this now, at least in part, with the newer Pentium-M designs.
> NOTE: I have not confirmed this, but it is a logical assumption based
> on how the i686 was redesigned beyond it's original 1GHz barrier in the
> P3 by Intel's own commentary that async _was_ used.
Bryan, now that I'm taking a computer organization and architecture
class, I searched my mail for "opcodes" and "branch prediction" and
found the above email. I'm finally able to follow about 70% of it.
It's fascinating!
When you say "entire units are now having their gates switch
independently of a clock", what do you mean by "unit"? The entire
microprocessor? In today's desktop processor (e.g., Intel Core Duo,
AMD Sempron, etc.), what percentage of the chip would you say is
clockless? What percentage of the motherboard is clockless? Was SIA's
prediction about "timing closure" met in 2006 yet (it seems Moore's
law has been stagnant for years now...)?
Thanks for your insight...
I just read an old Technology Review article on clockless chips that I
had in a stack of papers, and guess what... it mentions Theseus Logic!
Ha! I found a link to the article:
http://www1.cs.columbia.edu/async/misc/technologyreview_oct_01_2001.html
--
Justin M. Keyes
More information about the Pc_support
mailing list