The highlights of the P4(The Intel® NetBurst™ Micro-Architecture and Intel® 850 Chipset)
design |
The Pentium 4 processor is Intel's highest performance desktop processor as measured by the
SPEC CPU* 2000 benchmark. At 1.5 GHz, it earned a SPECint*2000 score of 535 and a
SPECfp*2000 score of 558. It shows immediate performance improvements across most
existing software applications available today, with performance levels varying depending on the
application category type and the application’s tendency to execute instructions and instruction
sequences that are optimally executed on the new micro-architecture. Among the improvements
found are:
- Hyper-Pipelined Technology: An ultra-deep 20-stage execution pipeline. Hyper Pipelined
Technology enables the Pentium 4 processor to execute software instructions in a 20-stage
pipeline, as compared to the 10-stage pipeline of the Pentium III processor. Hyper Pipelined
Technology supports a new range of clock speeds, beginning today with 1.5 and 1.4 GHz, with
plenty of headroom for the future. Deep pipelining of the execution unit makes it easier to turn
up the clock rate.
- Keeping the High-Frequency Execution Units Busy (vs. Sitting Idle): Although a
processor may have a high frequency capability, it must provide a means to ensure that the
execution units (integer and floating point) are continually being supplied with instructions for
execution. This ensures that these high-frequency units are not sitting idle. With the high
frequency of these execution units in the NetBurst micro-architecture and the implementation of
the Rapid Execution Engine, where the Arithmetic Logic Units are running at two times the
core frequency, Intel has implemented a number of features that ensure that these execution units
have a continuous stream of instructions to execute. Intel has implemented a 400-MHz system
bus, an Advanced Transfer Cache, an Execution Trace Cache, an Advanced Dynamic
Execution engine and a low-latency Level 1 Data Cache. These features work together to
quickly provide instructions and data to the processor’s high-performance execution units, thus
keeping them executing code instead of just idling at high frequency.
- Rapid Execution Engine: Through a combination of architectural, physical and circuit
designs, the simple Arithmetic Logic Units within the processor run at two times the frequency of
the processor core. This allows the ALUs to execute certain instructions with a latency that is ½
the duration of the core clock and results in higher execution throughput as well as reduced
latency of execution. For higher performance to enhance integer performance, the integer
arithmetic logic unit runs at twice the frequency of the rest of the processor. This allows
frequently used ALU instructions to be executed at double the core clock.
- Minimizing the Penalty Associated with Branch Mis-predicts: An enhanced branch
prediction unit to keep execution flowing. The 4K entry branch target array is eight times that of
the P6. Better branch prediction should keep the P4's deep pipeline executing the proper
instructions, avoiding branch mis-prediction penalties. Explanation of Branch Mis-predict
Penalty: As with the P6 generation, the NetBurst micro-architecture takes advantage of
out-of-order, speculative execution. This is where the processor routinely uses an internal
branch prediction algorithm to predict the result of branches in the program code and then
speculatively executes instructions down the predicted code branch. Although branch prediction
algorithms are highly accurate, they are not 100% accurate. If the processor mis-predicts a
branch, all the speculatively executed instructions must be flushed from the processor pipeline in
order to restart the instruction execution down the correct program branch. On more deeply
pipelined designs, more instructions must be flushed from the pipeline, resulting in a longer
recovery time from a branch mis-predict. The net result is that applications that have many,
difficult to predict, branches will tend to have a lower average IPC. Minimization of
mis-predict penalty: To minimize the branch mis-prediction penalty and maximize the average
IPC, the deeply pipelined NetBurst micro-architecture greatly reduces the number of branch
mis-predicts and provides a quick method of recovering from any branches that have been
mis-predicted. To minimize this penalty, the NetBurst micro-architecture has implemented an
Advanced Dynamic Execution engine and an Execution Trace Cache.
- Advanced Dynamic Execution: The Advanced Dynamic Execution engine is a very deep,
out-of-order speculative execution engine that keeps the execution units executing instructions. It
does so by providing a very large window of instructions from which the execution units can
choose. The large out-of-order instruction window allows the processor to avoid stalls that can
occur while instructions are waiting for dependencies to resolve. One of the more common
forms of stalls is waiting for data to be loaded from memory on a cache miss. This aspect is very
important in high frequency designs, as the latency to main memory increases relative to the core
frequency. The NetBurst micro-architecture can have up to 126 instructions in this window (vs.
the P6 micro-architecture’s much smaller window of 42 instructions. The ADE engine also
delivers an enhanced branch prediction capability that allows the Pentium 4 processor to be
more accurate in predicting program branches. This has the net effect of reducing the number of
branch mis-predictions by about 33% over the P6 generation processor’s branch prediction
capability. It does this by implementing a 4KB branch target buffer that stores more detail on the
history of past branches, as well as by implementing a more advanced branch prediction
algorithm. This enhanced branch prediction capability is one of the key design elements that
reduce the overall sensitivity of the NetBurst micro-architecture to the branch mis-prediction
penalty.
- Execution Trace Cache: The Execution Trace Cache is an innovative way to implement a
Level 1-instruction cache. The L1 instruction cache has been reworked as an "execution trace
cache," which stores decoded micro-ops instead of x86 instructions. It caches decoded x86
instructions (micro-ops), thus removing the latency associated with the instruction decoder from
the main execution loops. In addition, the Execution Trace Cache stores these micro-ops in the
path of program execution flow, where the results of branches in the code are integrated into the
same cache line. This increases the instruction flow from the cache and makes better use of the
overall cache storage space (12K micro-ops) since the cache no longer stores instructions that
are branched over and never executed. The result is a means to deliver a high volume of
instructions to the processor’s execution units and a reduction in the overall time required to
recover from branches that have been mis-predicted. The L1 data cache is surprisingly small at
8KB, with Intel presumably having chosen to sacrifice cache hit rates for lower latencies. These
low-latency L1 caches are also very high bandwidth.
- Advanced Transfer Cache: The Level 2 Advanced Transfer Cache is 256KB in size and
delivers a much higher data throughput channel between the Level 2 cache and the processor
core. The Advanced Transfer Cache consists of a 256-bit (32-byte) interface that transfers data
on each core clock. As a result, a 1.4-GHz Pentium 4 processor can deliver a data transfer rate
of 44.8GB/s (32 bytes x 1 (data transfer per clock) x 1.4 GHz = 44.8GB/s). This compares to
a transfer rate of 16GB/s on the Pentium III processor at 1 GHz and contributes to the Pentium
4 processor’s ability to keep the high-frequency execution units executing instructions vs. sitting
idle.
- Reducing the Number of Instructions Needed to Complete a Task or Program: Many
applications often perform repetitive operations on large sets of data. Further, the data sets
involved in these operations tend to be small values that can be represented with a small number
of bits. These two observations can be combined to improve application performance by both
compactly representing data sets and by implementing instructions that can operate in these
compact data sets. This type of operation is called Single Instruction Multiple Data (SIMD) and
can reduce the overall number of instructions that a program is required to execute. The
NetBurst micro-architecture implements 144 new SIMD instructions, called Streaming SIMD
Extensions 2 (SSE2). The SSE2 instruction set enhances the SIMD instructions previously
delivered with MMX technology and SSE technology. These new instructions support 128-bit
SIMD integer operations and 128-bit SIMD double-precision floating-point operations. By
doubling the amount of data on which a given instruction can operate, only half the number of
instructions in a code loop need to be executed.
- Streaming SIMD Extensions 2 (SSE2): With144 new Streaming SIMD Extensions
instructions, called SSE2, enable integer and floating-point operations of matrices of 128-bit
data, including double-precision floating-point math. SSE2 also includes operations for cache
and memory management. With the introduction of SSE2, the NetBurst micro-architecture now
extends the SIMD capabilities that MMX technology and SSE technology delivered by adding
144 new instructions that deliver 128-bit SIMD integer arithmetic operation and 128-bit SIMD
Double-Precision Floating-Point. These new instructions deliver the capability to reduce the
overall number of instructions required to execute a particular program task and as a result can
contribute to an overall performance increase. They accelerate a broad range of applications,
including video, speech, and image, photo processing, encryption, financial, engineering and
Scientific Applications.
- 400-MHz System Bus:The industry's first 400 MHz system bus speeds the transfer of data
between the processor and main memory. The Intel 850 chipset's dual RDRAM* memory
banks complement the Pentium 4 processor's 400 MHz system bus, providing up to 3.2
gigabytes of data per second. Like the Athlon's EV6 bus, the P4 bus sends data more than once
per clock cycle. In this case, it does so four times. This new bus is deeply pipelined and capable
of split transactions, and it features a few other optimizations to make better use of its
bandwidth. Through a physical signaling scheme of quad pumping the data transfers over a
100-MHz clocked system bus and a buffering scheme allowing for sustained 400-MHz data
transfers, the Pentium 4 processor supports Intel’s highest performance desktop system bus
delivering 3.2GB of data per second in and out of the processor. This compares to 1.06GB/s
delivered on the Pentium III processor’s 133-MHz system bus. Coupled with the P4's
improved bus are two channels of Direct Rambus DRAM, also providing 3.2GB/second of
peak bandwidth.