We will start with a conceptual understanding of PCI Express. This will let us appreciate the importance of PCI Express. This will be followed by a brief study of the PCI Express protocol. Then we will look at the enhancements and improvements of the protocol in the newer 3.0 specs.
1 Basic PC system architecture
We will start by looking at the basic layout of a PC system. Logically, an average PC system is laid out in something like shown in the figure.
The core logic chipset acts as a switch or router, and routes I/O traffic among the different devices that make up the system.
In reality, the core logic chipset is split into two parts: the northbridge and the southbridge (or I/O bridge). This split is there for a couple of reasons, the most important of which is the fact that there are three types of devices that naturally work very closely together, and so they need to have faster access to each other: the CPU, the main memory, and the video card. In a modern system, the video card's GPU is functionally a second (or third) CPU, so it needs to share privileged access to main memory with the CPU(s). As a result, these three devices are all clustered together off of the northbridge.
The northbridge is tied to a secondary bridge, the southbridge, which routes traffic from the different I/O devices on the system: the hard drives, USB ports, Ethernet ports, etc. The traffic from these devices is routed through the southbridge to the northbridge and then on to the CPU and/or memory.
As is evident from the diagram above, the PCI bus is attached to the southbridge. This bus is usually the oldest, slowest bus in a modern system, and is the one most in need of an upgrade.
The main thing that we should take away from the previous diagram is that the modern PC is a motley collection of specialized buses of different protocols and bandwidth capabilities. This mix of specialized buses designed to attach different types of hardware directly to the southbridge is something of a continuously evolving hack that has been gradually and collectively engineered by the PC industry as it tries to get around the limitations of the aging PCI bus. Because the PCI bus can't really cut it for things like Serial ATA, Firewire, etc., the trend has been to attach interfaces for both internal and external I/O directly to the southbridge. So today's southbridge is sort of the Swiss Army Knife of I/O switches, and thanks to Moore's Curves it has been able to keep adding functionality in the form of new interfaces that keep bandwidth-hungry devices from starving on the PCI bus.
In an ideal world, there would be one primary type of bus and one bus protocol that connects all of these different I/O devices ? including the video card/GPU ? to the CPU and main memory. Of course, this "one bus to rule them all" ideal is never, ever going to happen in the real world. It won't happen with PCI Express, and it won't happen with Infiniband (although it technically could happen with Infiniband if we threw away all of today's PC hardware and started over from scratch with a round of natively Infiniband-compliant devices).
Still, even though the utopian ideal of one bus and one bus protocol for every device will never be achieved, there has to be way bring some order to the chaos. Luckily for us, that way has finally arrived in the form of PCI Express (a.k.a. PCIe).
2 A primer on PCI
Before we go into detail on PCIe, it helps to understand how PCI works and what its limitations are.
The PCI bus debuted over a decade ago at 33MHz, with a 32-bit bus and a peak theoretical bandwidth of 132MB/s. This was pretty good for the time, but as the rest of the system got more bandwidth hungry both the bus speed and bus width were cranked up in a effort keep pace. Later flavors of PCI included a 64-bit, 33MHz bus combination with a peak bandwidth of 264MB/s; a more recent 64-bit, 66MHz combination with a bandwidth of 512MB/s.
PCI uses a shared bus topology to allow for communication among the different devices on the bus; the different PCI devices (i.e., a network card, a sound card, a RAID card, etc.) are all attached to the same bus, which they use to communicate with the CPU. Take a look at the following diagram to get a feel for what a shared bus looks like.
Because all of the devices attached to the bus must share it among themselves, there has to be some kind of bus arbitration scheme in place for deciding who gets access to the bus and when, especially in situations where multiple devices need to use the bus at the same time. Once a device has control of the bus, it becomes the bus master, which means that it can use the PCI bus to talk to the CPU or memory via the chipset's southbridge.
The shared bus topology's main advantages are that it's simple, cheap, and easy to implement ? or at least, that's the case as long as you're not trying to do anything too fancy with it. Once you start demanding more performance and functionality from a shared bus, then you run into its limitations. Let's take a look at some of those limitations, in order to motivate our discussion of PCI Express's improvements.
This scheme works fine when there are only a few devices attached to the bus, listening to it for addresses and data. But the nature of a bus is that any device that's attached to it and is "listening" to it injects a certain amount of noise onto the bus. Thus the more devices that listen to the bus ? and thereby place an electrical load on the bus ? the more noise there is on the bus and the harder it becomes to get a clean signal through.
2.1 Summary of PCI's shortcomings
To summarize, PCI as it exists today has some serious shortcomings that prevent it from providing the bandwidth and features needed by current and future generations of I/O and storage devices. Specifically, its highly parallel shared-bus architecture holds it back by limiting its bus speed and scalability, and its simple, load-store, flat memory-based communications model is less robust and extensible than a routed, packet-based model.
3 PCI-X: wider and faster, but still outdated
The PCI-X spec was an attempt to update PCI as painlessly as possible and allow it to hobble along for a few more years. This being the case, the spec doesn't really fix any of the inherent problems outlined above. In fact, it actually makes some of the problems worse.
The PCI-X spec essentially doubled the bus width from 32 bits to 64 bits, thereby increasing PCI's parallel data transmission abilities and enlarging its address space. The spec also ups PCI's basic clock rate to 66MHz with a 133MHz variety on the high end, providing yet another boost to PCI's bandwidth and bringing it up to 1GB/s (at 133MHz).
The latest version of the PCI-X spec (PCI-X 266) also double-pumps the bus, so that data is transmitted on the rising and falling edges of the clock. While this improves PCI-X's peak theoretical bandwidth, its real-world sustained bandwidth gains are more modest.
While both of these moves significantly increased PCI's bandwidth and its usefulness, they also made it more expensive to implement. The faster a bus runs, the sensitive it becomes to noise; manufacturing standards for high-speed buses are exceptionally strict for this very reason; shoddy materials and/or wide margins of error translate directly into noise at higher clock speeds. This means that the higher-speed PCI-X bus is more expensive to make.
The higher clock speed isn't the only thing that increases PCI-X's noise problems and manufacturing costs. The other factor is the increased bus width. Because the bus is wider and consists of more wires, there's more noise in the form of crosstalk. Furthermore, all of those new wires are connected at their endpoints to multiple PCI devices, which means an even larger load on the bus and thus more noise injected into the bus by attached devices. And then there's the fact that the PCI devices themselves need 32 extra pins, which increases the manufacturing cost of each individual device and of the connectors on the motherboard.
All of these factors, when taken together with the increased clock rate, combine to make the PCI-X a more expensive proposition than PCI, which keeps it out of mainstream PCs. And it should also be noted that most of the problems with increasing bus parallelism and double-pumping the bus also plague recent forms of DDR, and especially the DDR-II spec.
And after all of that pain, you still have to deal with PCI's shared-bus topology and all of its attendant ills. Fortunately, there's a better way.
4 PCI Express: the next generation
PCI Express (PCIe) is the newest name for the technology formerly known as 3GIO. Though the PCIe specification was finalized in 2002, PCIe-based devices have just now started to debut on the market.
PCIe's most drastic and obvious improvement over PCI is its point-to-point bus topology. Take a look at the following diagram, and compare it to the layout of the PCI bus.
In a point-to-point bus topology, a shared switch replaces the shared bus as the single shared resource by means of which all of the devices communicate. Unlike in a shared bus topology, where the devices must collectively arbitrate among themselves for use of the bus, each device in the system has direct and exclusive access to the switch. In other words, each device sits on its own dedicated bus, which in PCIe lingo is called a link.
Like a router in a network or a telephone switchbox, the switch routes bus traffic and establishes point-to-point connections between any two communicating devices on a system.
4.1 Enabling Quality of Service
The overall effect of the switched fabric topology is that it allows the "smarts" needed to manage and route traffic to be centralized in one single chip ? the switch. With a shared bus, the devices on the bus must use an arbitration scheme to decide among themselves how to distribute a shared resource (i.e., the bus). With a switched fabric, the switch makes all the resource-sharing decisions.
By centralizing the traffic-routing and resource-management functions in a single unit, PCIe also enables another important and long overdue next-generation function: quality of service (QoS). PCIe's switch can prioritize packets, so that real-time streaming packets (i.e., a video stream or an audio stream) can take priority over packets that aren't as time critical. This should mean fewer dropped frames in your first-person shooter and lower audio latency in your digital recording software.
4.2 Traffic runs in lanes
When PCIe's designers started thinking about a true next-generation upgrade for PCI, one of the issues that they needed to tackle was pin count. In the section on PCI above, I covered some of the problems with the kind of large-scale data parallelism that PCI exhibits (i.e. noise, cost, poor frequency scaling, etc.). PCIe solves this problem by taking a serial approach.
As I noted previously, a connection between two a PCIe device and a PCIe switch is called a link. Each link is composed of one or more lanes, and each lane is capable of transmitting one byte at a time in both directions at once. This full-duplex communication is possible because each lane is itself composed of one pair of signals: send and receive.
In order to transmit PCIe packets, which are composed of multiple bytes, a one-lane link must break down each packet into a series of bytes, and then transmit the bytes in rapid succession. The device on the receiving end must collect all of the bytes and then reassemble them into a complete packet. This disassembly and reassembly happens must happen rapidly enough to where it's transparent to the next layer up in the stack. This means that it requires some processing power on each end of the link. The upside, though, is that because each lane is only one byte wide, very few pins are needed to transmit the data. You might say that this serial transmission scheme is a way of turning processing power into bandwidth; this is in contrast to the old PCI parallel approach, which turns bus width (and hence pin counts) into bandwidth. It so happens that thanks to Moore's Curves, processing power is cheaper than bus width, hence PCIe's tradeoff makes a lot of sense.
We saw earlier that a link can be composed of "one or more lanes", so us clarify that now. One of PCIe's nicest features is the ability to aggregate multiple individual lanes together to form a single link. In other words, two lanes could be coupled together to form a single link capable of transmitting two bytes at a time, thus doubling the link bandwidth. Likewise, you could combine four lanes, or eight lanes, and so on.
A link that's composed of a single lane is called an x1 link; a link composed of two lanes is called an x2 link; a link composed of four lanes is called an x4 link, etc. PCIe supports x1, x2, x4, x8, x12, x16, and x32 link widths.
PCIe's bandwidth gains over PCI are considerable. A single lane is capable of transmitting 2.5Gbps in each direction, simultaneously. Add two lanes together to form an x2 link and you've got 5 Gbps, and so on with each link width. These high transfer speeds are good, good news, and will enable a new class of applications, like SLI video card rendering.
5 PCIe Protocol Details
Till now we were concerned with the system level impact of PCIe. We did not look at the protocol itself. The following material will make an attempt to explain the details of PCIe protocol, its layers and the functions of each of the layers in a brief way.
PCI Express is a high performance, general purpose I/O interconnect defined for a wide variety of future computing and communication platforms.
5.1 PCIe Link
A Link represents a dual-simplex communications channel between two components. The fundamental PCI Express Link consists of two, low-voltage, differentially driven signal pairs: a Transmit pair and a Receive pair
5.2 PCIe Fabric Topology
5.2.1 Root Complex
A Root Complex (RC) denotes the root of an I/O hierarchy that connects the CPU/memory
subsystem to the I/O.
Endpoint refers to a type of Function that can be the Requester or Completer of a PCI Express transaction either on its own behalf or on behalf of a distinct non-PCI Express device (other than a PCI device or Host CPU), e.g., a PCI Express attached graphics controller or a PCI Express-USB host controller. Endpoints are classified as either legacy, PCI Express, or Root Complex Integrated Endpoints.
5.2.3 PCI Express to PCI/PCI-X Bridge
A PCI Express to PCI/PCI-X Bridge provides a connection between a PCI Express fabric and a
5.3 PCI Express Layering Overview
PCI Express can be divided into three discrete logical layers: the Transaction Layer, the Data Link Layer, and the Physical Layer. Each of these layers is divided into two sections: one that processes outbound (to be transmitted) information and one that processes inbound (received) information.
PCI Express uses packets to communicate information between components. Packets are formed
in the Transaction and Data Link Layers to carry the information from the transmitting component to the receiving component. As the transmitted packets flow through the other layers, they are extended with additional information necessary to handle packets at those layers. At the receiving side the reverse process occurs and packets get transformed from their Physical Layer representation to the Data Link Layer representation and finally (for Transaction Layer Packets) to the form that can be processed by the Transaction Layer of the receiving device. Figure below shows the conceptual flow of transaction level packet information through the layers.
Note that a simpler form of packet communication is supported between two Data Link Layers
(connected to the same Link) for the purpose of Link management.
5.4 Layers of the Protocol
We take a brief look at the functions of each of the 3 layers.
5.4.1 Transaction Layer
This is the top layer that interacts with the software above.
Functions of Transaction Layer:
1. Mechanisms for differentiating the ordering and processing requirements of Transaction Layer Packets (TLPs)
2. Credit-based flow control
3. TLP construction and processing
4. Association of transaction-level mechanisms with device resources including Flow Control and Virtual Channel management
5.4.2 Data Link Layer
The Data Link Layer acts as an intermediate stage between the Transaction Layer and the Physical Layer. Its primary responsibility is to provide a reliable mechanism for exchanging Transaction Layer Packets (TLPs) between the two components on a Link.
Functions of Transaction Layer:
1. Data Exchange:
2. Error Detection and Retry:
3. Initialization and power management:
5.4.3 Physical Layer
The Physical Layer isolates the Transaction and Data Link Layers from the signaling technology
used for Link data interchange. The Physical Layer is divided into the logical and electrical subblocks.
Takes care of Symbol Encoding, framing, data scrambling, Link initialization and training, Lane to lane de-skew
The electrical sub-block section defines the physical layer of PCI Express 5.0 GT/s that consists of a reference clock source, Transmitter, channel, and Receiver. This section defines the electrical-layer parameters required to guarantee interoperability among the above-listed PCI Express components. This section comprehends both 2.5 GT/s and 5.0 GT/s electricals. In many cases the parameter definitions between 2.5 and 5.0 GT/s are identical, even though their respective values may differ. However, the need at 5.0 GT/s to minimize guardbanding, while simultaneously comprehending all phenomena affecting signal integrity, requires that all the PCI Express system components - Transmitter, Receiver, channel, and Refclk, be explicitly defined in the specification. For this reason, each of these four components has a separate specification section for 5.0 GT/s.
6 Changes in PCIe 3.0 (GEN3)
The goal of the PCI-SIG work group defining this next-generation interface was to double the bandwidth of PCIe Gen 2, which is 5 gigatransfers per second (GT/s) signaling but 4GT/s effective bandwidth after 8b/10b encoding overhead. The group had two choices: either to increase the signaling rate to 10GT/s with 20 percent encoding overhead or select a lower signaling rate (8GT/s) for better signal integrity and reduced encoding overhead with a different set of challenges. The PCI-SIG decided to go with 8GT/s and reduce the encoding overhead to offer approximately 7.99GT/s of effective bandwidth, approximately double that of PCIe Gen 2. The increase in signaling rate from PCIe Gen 2's 5GT/s to PCIe Gen 3's 8GT/s provides a sixty percent increase in data rate and the remainder of the effective bandwidth increase comes from replacing the 8b/10b encoding (20 percent inefficiency) with 128b/130b coding (1-2 percent inefficiency). The challenge of moving from PCIe Gen 2 to Gen 3 is to accommodate the signaling rate where clock timing goes from 200ps to 125ps, jitter tolerance goes from 44ps to 14ps and the total sharable band (for SSC) goes down from 80ps to 35ps. These are enormous challenges to overcome but the PCI-SIG has already completed board, package, and system modeling to make sure designers are able to develop systems that support these rates. The table below highlights some key aspects of PCIe Gen 2 and Gen 3. The beauty of the Gen 3 solution is that it will support twice the data rate with equal or lower power consumption than PCIe Gen 2. Additionally, applications using PCIe Gen 2 would be able to migrate seamlessly as the reference clock remains at 100MHz and the channel reach for mobiles (8 inches), clients (14 inches), and volume servers (20 inches) stay the same. More complex equalizers, such as decision feedback equalization, may be implemented optionally for extended reach needed in a backplane environment. The Gen 3 specification will enhance signaling by adding transmitter de-emphasis, receiver equalization, and optimization of Tx/Rx Phase Lock Loops and Clock Data Recovery. The Gen 3 specification also requires devices that support Gen 3 rate to dynamically negotiate up or down to/from Gen 1 and Gen 2 data rates based on signal/line conditions.
6.1 Benefits from the newer specs:-
6.1.1 Higher Speed
Goal: improve performance. Each successive generation doubles the bit rate of
the previous generation, and that holds true for Gen3, too, but thereís a significant difference this
time. Since the previous speed was 5.0 GT/s, the new speed would normally have been 10.0
GT/s, but the spec writers considered a signal that used a 10GHz clock problematic because of
the board design and signal integrity issues that many vendors would face. Constrained to stay
under that frequency, they were forced to consider other options. The solution they chose was to
move away from the 8b/10b encoding scheme that PCIe and most other serial transports have
used because it adds a 20% overhead from the receiverís perspective. Instead, they chose a
modified scrambling method that effectively creates a 128/130 encoding method. This gain in
efficiency meant that an increase in frequency to only 8.0GHz would be enough to achieve a
doubling of the bandwidth and meet this goal
6.1.2 Resizable BAR Capability
Goal: allow the system to select how much system resource is
allocated to a device. This new optional set of registers allows functions to communicate their
resources size options to system software, which can then select the optimal size and
communicate that back to the function. Ideally, the software would use the largest setting
reported, since that would give the best performance, but it might choose a smaller size to
accommodate constrained resources. Currently, sizes from 1MB to 512GB are possible. If these
registers are implemented, there is one capability register to report the possible sizes, and one
control register to select the desired size for each BAR. Note that devices might report a smaller
size by default to help them be compatible in many systems, but using the smaller size would
also reduce its performance.
6.1.3 Dynamic Power Allocation
Goal: provide more software-controlled power states to improve
power management (PM). Some endpoint devices donít have a device-specific driver to manage
their power efficiently, and DPA provides a means to fill that gap. DPA only applies when the
device is in the D0 state, and it defines up to 32 substates. Substate0 (default) defines the max
power, and successive sub-states have a power allocation equal to or lower than the previous
one. Software is permitted to change the sub-states in any order.
The Substate Control Enabled bit can be used to disable this capability. Any time the device is
changing between substates, it must always report the highest power requirement of the two until
the transition has been completed, and the time needed to make the change is implementation
To allow software to set up PM policies, functions define two transition latency values and every
substate associates its max transition time (longest time it takes to enter that substate from any
other substate) with one of those.
6.1.4 Alternative Routing-ID Interpretation
Goal: support a much larger number of functions inside devices. For requesters and completers, this means treating the device number value as though
it was really just an extension of the function field to give an 8-bit value for the function number.
Since the device number is no longer included, itís always assumed to be 0.
The spec also defines a new set of optional registers that can be used to assign a function group
number to each function. Within an ARI device, several functions can be associated with a single
group number, and that can serve as the basis for arbitration or access control.