Technical Details of Pentium 4

The highlights of the P4(The Intel® NetBurst™ Micro-Architecture and Intel® 850 Chipset) design

The Pentium 4 processor is Intel's highest performance desktop processor as measured by the SPEC CPU* 2000 benchmark. At 1.5 GHz, it earned a SPECint*2000 score of 535 and a SPECfp*2000 score of 558. It shows immediate performance improvements across most existing software applications available today, with performance levels varying depending on the application category type and the application’s tendency to execute instructions and instruction sequences that are optimally executed on the new micro-architecture. Among the improvements found are:

Hyper-Pipelined Technology: An ultra-deep 20-stage execution pipeline. Hyper Pipelined Technology enables the Pentium 4 processor to execute software instructions in a 20-stage pipeline, as compared to the 10-stage pipeline of the Pentium III processor. Hyper Pipelined Technology supports a new range of clock speeds, beginning today with 1.5 and 1.4 GHz, with plenty of headroom for the future. Deep pipelining of the execution unit makes it easier to turn up the clock rate.
Keeping the High-Frequency Execution Units Busy (vs. Sitting Idle): Although a processor may have a high frequency capability, it must provide a means to ensure that the execution units (integer and floating point) are continually being supplied with instructions for execution. This ensures that these high-frequency units are not sitting idle. With the high frequency of these execution units in the NetBurst micro-architecture and the implementation of the Rapid Execution Engine, where the Arithmetic Logic Units are running at two times the core frequency, Intel has implemented a number of features that ensure that these execution units have a continuous stream of instructions to execute. Intel has implemented a 400-MHz system bus, an Advanced Transfer Cache, an Execution Trace Cache, an Advanced Dynamic Execution engine and a low-latency Level 1 Data Cache. These features work together to quickly provide instructions and data to the processor’s high-performance execution units, thus keeping them executing code instead of just idling at high frequency.
Rapid Execution Engine: Through a combination of architectural, physical and circuit designs, the simple Arithmetic Logic Units within the processor run at two times the frequency of the processor core. This allows the ALUs to execute certain instructions with a latency that is ½ the duration of the core clock and results in higher execution throughput as well as reduced latency of execution. For higher performance to enhance integer performance, the integer arithmetic logic unit runs at twice the frequency of the rest of the processor. This allows frequently used ALU instructions to be executed at double the core clock.
Minimizing the Penalty Associated with Branch Mis-predicts: An enhanced branch prediction unit to keep execution flowing. The 4K entry branch target array is eight times that of the P6. Better branch prediction should keep the P4's deep pipeline executing the proper instructions, avoiding branch mis-prediction penalties. Explanation of Branch Mis-predict Penalty: As with the P6 generation, the NetBurst micro-architecture takes advantage of out-of-order, speculative execution. This is where the processor routinely uses an internal branch prediction algorithm to predict the result of branches in the program code and then speculatively executes instructions down the predicted code branch. Although branch prediction algorithms are highly accurate, they are not 100% accurate. If the processor mis-predicts a branch, all the speculatively executed instructions must be flushed from the processor pipeline in order to restart the instruction execution down the correct program branch. On more deeply pipelined designs, more instructions must be flushed from the pipeline, resulting in a longer recovery time from a branch mis-predict. The net result is that applications that have many, difficult to predict, branches will tend to have a lower average IPC. Minimization of mis-predict penalty: To minimize the branch mis-prediction penalty and maximize the average IPC, the deeply pipelined NetBurst micro-architecture greatly reduces the number of branch mis-predicts and provides a quick method of recovering from any branches that have been mis-predicted. To minimize this penalty, the NetBurst micro-architecture has implemented an Advanced Dynamic Execution engine and an Execution Trace Cache.
Advanced Dynamic Execution: The Advanced Dynamic Execution engine is a very deep, out-of-order speculative execution engine that keeps the execution units executing instructions. It does so by providing a very large window of instructions from which the execution units can choose. The large out-of-order instruction window allows the processor to avoid stalls that can occur while instructions are waiting for dependencies to resolve. One of the more common forms of stalls is waiting for data to be loaded from memory on a cache miss. This aspect is very important in high frequency designs, as the latency to main memory increases relative to the core frequency. The NetBurst micro-architecture can have up to 126 instructions in this window (vs. the P6 micro-architecture’s much smaller window of 42 instructions. The ADE engine also delivers an enhanced branch prediction capability that allows the Pentium 4 processor to be more accurate in predicting program branches. This has the net effect of reducing the number of branch mis-predictions by about 33% over the P6 generation processor’s branch prediction capability. It does this by implementing a 4KB branch target buffer that stores more detail on the history of past branches, as well as by implementing a more advanced branch prediction algorithm. This enhanced branch prediction capability is one of the key design elements that reduce the overall sensitivity of the NetBurst micro-architecture to the branch mis-prediction penalty.
Execution Trace Cache: The Execution Trace Cache is an innovative way to implement a Level 1-instruction cache. The L1 instruction cache has been reworked as an "execution trace cache," which stores decoded micro-ops instead of x86 instructions. It caches decoded x86 instructions (micro-ops), thus removing the latency associated with the instruction decoder from the main execution loops. In addition, the Execution Trace Cache stores these micro-ops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line. This increases the instruction flow from the cache and makes better use of the overall cache storage space (12K micro-ops) since the cache no longer stores instructions that are branched over and never executed. The result is a means to deliver a high volume of instructions to the processor’s execution units and a reduction in the overall time required to recover from branches that have been mis-predicted. The L1 data cache is surprisingly small at 8KB, with Intel presumably having chosen to sacrifice cache hit rates for lower latencies. These low-latency L1 caches are also very high bandwidth.
Advanced Transfer Cache: The Level 2 Advanced Transfer Cache is 256KB in size and delivers a much higher data throughput channel between the Level 2 cache and the processor core. The Advanced Transfer Cache consists of a 256-bit (32-byte) interface that transfers data on each core clock. As a result, a 1.4-GHz Pentium 4 processor can deliver a data transfer rate of 44.8GB/s (32 bytes x 1 (data transfer per clock) x 1.4 GHz = 44.8GB/s). This compares to a transfer rate of 16GB/s on the Pentium III processor at 1 GHz and contributes to the Pentium 4 processor’s ability to keep the high-frequency execution units executing instructions vs. sitting idle.
Reducing the Number of Instructions Needed to Complete a Task or Program: Many applications often perform repetitive operations on large sets of data. Further, the data sets involved in these operations tend to be small values that can be represented with a small number of bits. These two observations can be combined to improve application performance by both compactly representing data sets and by implementing instructions that can operate in these compact data sets. This type of operation is called Single Instruction Multiple Data (SIMD) and can reduce the overall number of instructions that a program is required to execute. The NetBurst micro-architecture implements 144 new SIMD instructions, called Streaming SIMD Extensions 2 (SSE2). The SSE2 instruction set enhances the SIMD instructions previously delivered with MMX technology and SSE technology. These new instructions support 128-bit SIMD integer operations and 128-bit SIMD double-precision floating-point operations. By doubling the amount of data on which a given instruction can operate, only half the number of instructions in a code loop need to be executed.
Streaming SIMD Extensions 2 (SSE2): With144 new Streaming SIMD Extensions instructions, called SSE2, enable integer and floating-point operations of matrices of 128-bit data, including double-precision floating-point math. SSE2 also includes operations for cache and memory management. With the introduction of SSE2, the NetBurst micro-architecture now extends the SIMD capabilities that MMX technology and SSE technology delivered by adding 144 new instructions that deliver 128-bit SIMD integer arithmetic operation and 128-bit SIMD Double-Precision Floating-Point. These new instructions deliver the capability to reduce the overall number of instructions required to execute a particular program task and as a result can contribute to an overall performance increase. They accelerate a broad range of applications, including video, speech, and image, photo processing, encryption, financial, engineering and Scientific Applications.
400-MHz System Bus:The industry's first 400 MHz system bus speeds the transfer of data between the processor and main memory. The Intel 850 chipset's dual RDRAM* memory banks complement the Pentium 4 processor's 400 MHz system bus, providing up to 3.2 gigabytes of data per second. Like the Athlon's EV6 bus, the P4 bus sends data more than once per clock cycle. In this case, it does so four times. This new bus is deeply pipelined and capable of split transactions, and it features a few other optimizations to make better use of its bandwidth. Through a physical signaling scheme of quad pumping the data transfers over a 100-MHz clocked system bus and a buffering scheme allowing for sustained 400-MHz data transfers, the Pentium 4 processor supports Intel’s highest performance desktop system bus delivering 3.2GB of data per second in and out of the processor. This compares to 1.06GB/s delivered on the Pentium III processor’s 133-MHz system bus. Coupled with the P4's improved bus are two channels of Direct Rambus DRAM, also providing 3.2GB/second of peak bandwidth.