🧜🏻 🏨 👩🏻 PCI Express in Intel V-Series FPGAs: Interface Basics and Hardware Core Features 🖖🏾 🕢 🦁

Introduction

The PCI Express or PCIe interface, familiar to many, was available to developers of FPGA systems already when it was just beginning to spread in digital technology. At this time, there was a solution in which the software core was connected to an external physical-level microcircuit [ 5 ]. This made it possible to create a single-lane PCIe line at a speed of 2.5 gigatransactions per second. Further, thanks to the development of technologies, the physical layer of the interface migrated to the PCIe hardware blocks inside the FPGAs themselves; the number of possible channels increased to 8, and in a number of new microcircuits - to 16; Following modern standards, possible data transfer rates have grown.

At the same time, it is still difficult to find auxiliary materials on working with the hardware cores of modern FPGAs in Russian-language sources; not much information is available on the PCIe interface itself. Guide to hardware PCI Express cores implies that the developer has already become familiar with the standard and understands the basics of data transfer between the device and a personal computer (PC). However, the abundance of information in the PCIe standard itself does not immediately understand what steps must be taken to successfully transfer data from the device to the PC memory or vice versa. To get a more complete picture, a considerable part of the information has to be collected bit by bit from various sources. For developers of Intel FPGA systems, the difficulty is alsothat most of the available materials and articles describe working with Xilinx FPGA hardware cores.

In this article, the author will try to talk about what the FPGA system designer needs to know to work with the PCI Express interface; will consider the features of working with hardware cores PCI Express FPGAs of the V-series from Intel in the Avalon-ST version.

PCIe Levels and Packet Types

Despite the fact that PCI Express is often called a bus, in fact, this interface is a network of devices connected by groups of serial duplex channels. The PCI Express network itself consists of several main nodes: the root (Root), the endpoint (Endpoint) and the router (Switch) (Figure 1). To transfer data only between two devices, it is enough to have a root and an endpoint. In the case of modern PCs, the network root is located on one substrate along with the cores of the central processor. Regardless of where the PCIe root is located, it is associated with system memory.

Figure 1 - PCIe Network

The PCIe data transfer protocol is divided into three layers: the Transaction Layer, the Data Link Layer, and the Physical Layer. Interface data is transmitted in the form of packets. A generalized view of packets is shown in Figure 2.

Figure 2 - A generalized view of PCIe packets

At the transaction level, any packet (TLP) consists of at least a header. Depending on the type of package, the header may be followed by data - the useful contents of the package. An additional checksum can also be added at the end of the package. The following main types of transaction-level packets (table 1) exist:

Table 1 - Types of transaction-level packets

No. p.	Package view	Name of package type according to specification
1	Memory read request	Memory read request
2	Memory Write Request	Memory write request
3	I / O space read request	I / O Read Request
4	I / O space write request	I / O Write Request
5	Read configuration request	Configuration read request
6	Write configuration request	Configuration Write Request
7	Reading response	Completion
8	Message	Message

At the link layer, a packet sequence number and link checksum are added to each transaction level packet. The data link layer also forms its own types of packets (DLLP), which include (table 2):

Table 2 - Data link packet types

No. p.	Package view	Name of package type according to specification
1	Transaction Level Package Confirmation	TLP Ack
2	Transaction Level Package Rejection	TLP Nack
3	Power management	Power management
4	Data flow control	Flow control

Finally, the physical layer complements the packets with symbols of the beginning and end of the packets, which are borrowed from the IEEE 802.3 standard. For transaction-level packets, the symbols K27.7 and K29.7 are used, respectively; for data link packets, symbols K28.2 and K29.7.
When working with FPGA hardware cores, the developer needs to form only transaction-level packets; channel and physical layer packets are formed by kernel blocks.

Transaction-Level Packet Routing

In total, different types of packets can arrive from sender to receiver in three ways:

routing to the address;
ID routing
indirect routing.

The relationship between the routing method and the type of transaction level packet is presented in Table 3.

Table 3 - Correspondence of the routing method and packet type

№p
1		. I/O I/O
2		. ID.
3		ID

.

Each endpoint has its own Configuration Space, where various instruction and status registers are located. Among them are Base Address Register or BAR. When endpoints are initialized, the BIOS or operating system scans the endpoint BARs to determine what size of memory and space is required for each endpoint. Then, in each active BAR, the starting address of the allocated portion of system memory is written. As a result, the endpoint acquires an address where appropriate requests can be sent. Usually, at the endpoint, a register map is formed, which is tied to the allocated memory areas.

Also, each endpoint, or rather, the logical device inside, gets its unique identifier, which consists of three parts: bus number, device number, logical device (function) number.

In this way, the system has enough information to communicate with the endpoint. However, transmitting data using queries in the BAR has poor performance. Firstly, for a 32-bit wide BAR, the usable request length is limited to one double word (DWORD); for a 64-bit BAR, two double words. Secondly, each request occurs with the participation of the central processor. To reduce the load on the central processor, as well as increase the size of each package, it is necessary that the endpoint independently move data to or from system memory. To do this, the endpoint must know at which system memory addresses it can write or read data.

Given the above, the general data transfer scheme between the endpoint and the system memory can be represented as follows:

the endpoint driver allocates buffers in the system memory for writing data;
the driver forms in the system memory a set of addresses and buffer sizes - buffer descriptors for writing data;
the endpoint driver writes the address of the set of descriptors to the device registers associated with the BAR areas;
the endpoint driver programs data transfer control registers associated with BAR areas;
the endpoint sends a request to read system memory to obtain a set of descriptors for writing to system memory;
the endpoint sends write requests to the system memory and fills the storage buffers;
/ , , , ;
PCIe.

At the stage when the driver configures the endpoint registers, depending on the type of address space associated with the BAR, the endpoint will receive a write request to memory (Figure 3) or a write request to the I / O space. If the driver reads a register during register configuration, the endpoint also receives the corresponding read requests (Figure 4).

Figure 3 - Example of a request to write to memory 1 DW long

Figure 4 - Example of a request to read from memory 1 DW long

Unlike write or read requests, I / O requests have a number of limitations. First, both write and read requests require a response from the recipient. This leads to the fact that the data transfer rate using requests to the I / O space becomes much lower than the theoretical PCIe bandwidth allows. Secondly, the address of I / O space requests is limited to 32 bits, which does not allow access to fragments of system memory beyond 4 GB. Third, I / O space requests cannot exceed one double word and cannot use multiple virtual channels for transportation. For these reasons, requests for writing and reading to the I / O space will not be further considered. Nonetheless,the contents of the headers for writing / reading memory and I / O space differ only in a number of fields, therefore, the packet structures shown in Figures 3, 4 are also applicable to requests in the I / O space.

When an endpoint or PCIe root receives a request to read memory or I / O space, the device must send a response. If the sender of the request does not receive a response within a certain time, this will lead to an error in waiting for a response. If for some reason the device cannot send the requested data, it must generate an error response. Possible reasons may be: the recipient does not support this request (Unsupported Request); the recipient is not ready to accept the configuration request and requests to repeat it later (Configuration Request Retry Status), an internal error has occurred, due to which the recipient cannot respond and rejects the request (Completer Abort).

Formats for a successful response to a read request and an error response for an unsupported request are shown in Figures 5, 6.

Figure 5 - Example of a successful response to read

Figure 6— Example of an answer about an unsupported request

While the endpoint is accessing a memory area within 4 GB, the format of the packet headers does not differ from the headers shown in Figures 3, 4. For write requests or reading memory beyond 4 GB, an additional double word with high order bits of the destination address is used in the header (Figure 7).

Figure 7 - An example of a 128-byte write request header.

Explanations of abbreviated names of packet header fields are presented in Table 4.

Table 4 - List of abbreviations for header fields

No. p.	Field designation	Field name	Appointment
1	TC	Traffic Category - Traffic Class	Defines virtual channel membership
2	Atr	Attributes	: , , ID, ID.
3	TH	‒ TLP Processing Hint	, [1..0] .
4	TD	‒ TLP Digest	, .
5	EP		, .
6	AT	‒ Address Translation	, : , ,
7	BE	‒ Byte Enable
8	PH	Package Processing Hint - Processing Hint	Tips the package recipient how the package should be used, as well as the data structure
9	BCM	The presence of a change in the number of bytes	Indicates whether the number of bytes in the packet has changed. Only a sender in the face of a PCI-X device can set a flag

If an endpoint uses interrupts to report an event, it must also form an appropriate packet. In total, PCIe can use three types of interrupts:

legacy interrupts (Legacy Interrupts or INT);
interruptions in the form of messages (Message Signaled Interrupts or MSI);
extended message interrupts (Message Signaled Interrupts Extended or MSI-X).

Inherited INT interrupts are used for compatibility with systems that do not support message interrupts. In fact, this type of interrupt is a message (a packet of type Message) that simulates the operation of a physical interrupt line. Upon a specified event, the endpoint sends a message to the PCIe root that the INT interrupt has been activated, and then waits for action from the interrupt handler. Until the interrupt handler performs the specified action, the INT interrupt is in the activated state. Inherited interrupts do not allow you to determine the source of the event, which forces the interrupt handler to sequentially scan all endpoints in the PCIe tree to service this interrupt. When the interrupt is serviced, the endpoint sends a message stating thatthat the INT interrupt is more inactive. The FPGA hardware cores, on a signal from the user logic, independently generate the necessary messages for INT interrupts, so the packet structure will not be considered.

Message interrupts along with their extended version are the main and mandatory type of interrupt in PCIe. Both types of interrupts, in fact, are a request to write to the system memory with a length of one double word. The difference from a regular request is that the record address and package contents are allocated for each device at the stage of system configuration. In this case, the local Advanced Programmable Interrupt Controller (LAPIC) inside the central processor becomes the destination. When using this type of interrupt, it is not necessary to sequentially poll all devices in the PCIe tree. Moreover, if the system allows the device to use several interrupt vectors, each vector can be associated with its own event.Together, this reduces the processor time for processing interrupts and increases the overall system performance.

MSI interrupts allow the formation of up to 32 separate vectors. The exact number depends on the capabilities of the endpoint. In this case, the system may allow the use of only part of the vectors. At the configuration stage, the system writes the interrupt address and the initial data for writing to the special registers of the configuration space of the endpoint. All active interrupts use the same address. But for each vector, the endpoint changes the bits of the initial data. For example, let an endpoint support a maximum of 4 interrupt vectors, all 4 vectors are allowed in the system, and the initial data for writing is 0x4970. Then, to form the first vector, the endpoint passes the initial data unchanged. For the second vector, the device changes the first bit and transmits the number 0x4971.For the third and fourth vectors, the device will transmit the numbers 0x4972 and 0x4973, respectively.

The FPGA hardware cores independently form a packet with an MSI interrupt by a signal from the user logic. However, before commanding the kernel to send an interrupt, the user logic must also provide the contents of the packet for the required vector to a special kernel interface.

MSI-X interrupts allow the formation of up to 2048 individual vectors. In the corresponding registers of the Configuration space, the endpoint indicates which of the BAR address spaces and with what offset from the base address the interrupt table (Figure 8) and the table of pending interrupt flags (Pending Bit Array - PBA, Figure 9) are located, as well as the sizes of both tables. The system writes a separate address and data for writing to each line of the interrupt table, and also allows or prohibits the use of a specific vector through the first bit of the Vector Control field. For a given event, the endpoint sets a flag in the flag table of pending interrupts. If no mask is set for this interrupt in the Vector Control field, the endpoint sends an interrupt to the address from the interrupt table with the specified contents of the packet.

Figure 8 - Table of MSI-X interrupt vectors

Figure 9 - Table of flags for pending interrupts

FPGA hardware cores do not have a specialized interface for MSI-X interrupts. The developer himself must create an interrupt table in the user logic and a table of flags of pending interrupts. An interrupt package is also completely user-generated and transmitted through the general kernel interface along with other types of packages. The packet format in this case, as already mentioned above, corresponds to a request for writing to the system memory with a length of one double word.

Features of the hardware cores PCI Express FPGA V-series from Intel in the version Avalon-ST

Despite the fact that the hardware cores of PCI Express FPGAs from different manufacturers implement similar functionality, individual core interfaces or the order of their operation may differ.
Intel V-Series PCI Express FPGA hardware cores are available in two versions: with Avalon-MM and Avalon-ST. The latter, although it requires more effort from the developer, allows you to get the most bandwidth. For this reason, a kernel with an Avalon-MM interface will not be considered.

The PCI Express core documentation with the Avalon-ST interface describes in sufficient detail the kernel parameters, input and output signals. However, the kernel has a number of features that a developer should pay attention to.

The first group of features relates to methods that allow you to configure FPGAs within 100 ms according to PCIe requirements. In addition to parallel loading of the FPP type, the developer is offered such methods as Configuration via Protocol (CvP) and the autonomous mode of the kernel (autonomous mode). The developer must make sure that the configuration via the protocol or the standalone kernel mode is supported for the selected PCIe speed (“Lane Rate” parameter). For configuration via the protocol, relevant information can be found in the kernel documentation. In the case of offline mode, there is no such information, so you need to compile the project. If the standalone kernel mode is not supported for the current kernel speed, Quartus will generate a corresponding error (Figure 10).

Figure 10 - Error compiling a PCIe kernel for offline mode

If a developer plans to use configuration via a protocol, he should also pay attention to which FPGA core the PCIe connector is connected to. This is especially true if the developer uses not a finished board, but his own device. In FPGAs with multiple PCIe hardware cores, only one core allows CvP to be enabled. The location of the core with CvP support is indicated in the FPGA documentation.
The second group of features relates to the Avalon-ST data transfer interface itself. It is this interface that is used to transfer transaction-level packets between the user logic and the kernel.

On the receiving side, the kernel has two signals that allow the user to pause the receipt of received packets: the rx_st_mask signal and the rx_st_ready signal.
Using the rx_st_ready signal, the developer can pause the output of all types of packets. However, if you activate this signal, the kernel will stop the output of packets after only two clock cycles of the operating frequency. Therefore, during signal activation, the user logic must be ready to receive an additional amount of data. If, for example, a developer uses a buffer in the form of a FIFO, he should avoid buffer overflows. Otherwise, some of the contents of the package will be lost.

Using the signal "rx_st_mask", the developer suspends the issuance of requests for which answers must be sent. This signal also does not immediately stop the output of packets. According to the documentation, after activating the signal, the kernel can issue up to 10 requests. If the user logic activates “rx_st_mask”, and there is not enough space in the buffer to process received packets, this may also activate the “rx_st_ready" signal. In this situation, the user logic stops reading any packets from the internal hardware kernel buffer. This not only overruns the hardware kernel buffers, but also violates packet order requirements. The device must skip ahead requests that do not require a response, and read responses. Otherwise, the data channel will be tightly blocked.For this reason, the developer should use an additional buffer to process requests with answers and not allow the logic to block higher priority packets.

On the transmitting side, tx_st_valid and tx_st_ready signals can cause problems. If the tx_st_ready signal is active, it is forbidden for the user logic to reset the tx_st_valid in the middle of the outgoing packet. This means that during the transfer, the developer must provide the entire contents of the package. If the data source is slower than the kernel interface, the user logic must accumulate the required amount of data before the start of the package.
Both on the receiving side and on the transmitting side, the developer should pay attention to the byte order in the header and contents of the packet, as well as data alignment.

In the Avalon-ST packet of the hardware core, within each double word inside the header of the PCIe packet, the bytes follow from low to high; inside the package contents - from oldest to youngest. The developer must use a similar order in outgoing packets to successfully transfer data from the endpoint to the root.

The Avalon-ST interface of the hardware core aligns the data in multiples of 64 bits. Depending on the width of the Avalon-ST interface, the length of the transaction-level packet header and the packet address, the kernel may add an empty double word between the packet header and its contents. In turn, when transmitting data, user logic must add an empty double word in advance, by analogy with the kernel. This empty double word is not taken into account in the packet length and is necessary only for the correct operation of the hardware kernel.

The next feature is related to incoming read responses. The kernel description says that it does not miss incoming responses whose identifier does not match the outgoing request. At the same time, the user logic should follow the waiting time for responses. If the wait time is exceeded, the user logic must raise the flag “cpl_err [0]” or “cpl_err [1]”. It’s not clear from the documentation how filtering will work when the endpoint sends multiple read requests. The user logic only tells the kernel that the timeout has expired for one of the requests, but cannot pass the identifier of this request to the kernel. There is a possibility that the kernel may transmit to the user side responses for a request with an expired timeout. Therefore, the developer must create their own filter for incoming responses.

Finally, developers are strongly encouraged to use the information on available loans for outgoing packages. The core documentation says that this is not necessary, since the kernel checks loans and blocks packages when there are not enough loans. However, all types of packages arrive at the kernel through a single interface. If the kernel packet buffer overflows, the kernel lowers the tx_st_ready signal to zero. Until the tx_st_ready signal is set to one, the user logic, in principle, cannot send any packets. The number of available loans is updated through packages from a partner device. If the user's logic not only often writes, but also reads, then the speed with which the kernel updates the limit counters drops. In the end, overall system performance suffers.

Conclusion

The article described the general principles of data transfer via PCI Express, the formats of the main data packets. Nevertheless, the author omitted such interface components as virtual channels, control of the volume of incoming responses to read, and the order of packets is not strict. These topics are discussed in detail in a number of foreign sources [ 4 , 6 ].
The article also includes the features of Intel Express V-series FPGA FPGA hardware cores that the author encountered while working on the interface controller. This experience may be useful to other developers.

PCI Express in Intel V-Series FPGAs: Interface Basics and Hardware Core Features

Introduction

PCIe Levels and Packet Types

Transaction-Level Packet Routing

.

Features of the hardware cores PCI Express FPGA V-series from Intel in the version Avalon-ST

Conclusion

List of sources used

More articles: