1296 - Do the Lattice PCIe DevKit endpoint reference designs support Write-Combining transactions?

1296 - Do the Lattice PCIe DevKit endpoint reference designs support Write-Combining transactions?

Write-Combining allows the CPU to burst 64 byte MWr TLPs to a PCIe endpoint, but there are implications.

PC CPUs have a memory caching mode known as Write-Combining (WC). Write Combining allows the memory manager of the CPU to buffer up writes destined for the same address range. Once the WC cache line (current Pentiums are 64 byte) is filled the memory manager performs a burst write. In PCIe this becomes a TLP with a 64 byte payload. The memory manager only performs WC if the memory range is marked as WC. This is either done by a device driver marking a BAR range as having the WC attribute, or by the BIOS/OS recognizing the PCI/PCIe device as a video device (class code). Write Combining is primarily targetted for video frame buffers, to improve performance when the CPU is loading millions of pixels into the graphics hardware to change the display image.

Write Combining has the following characteristics:
  • The MWr TLP payload size may be up to 64 bytes.
  • The TLPs may be sent in any order.
  • There is no analogous read operation.
  • It is software driven.

Impacts on a PCIe endpoint:
  • The PCIe Completer design must be able to handle 1 byte to 64 bytes TLPs. The Lattice PCIe Completer is not designed to handle bursts from the Root. Therefore the reference designs do not support Write Combining.
  • TLPs may be sent to the end point in any address order. The WC cache line flush rules may flush TLPs in non-sequential address order. This is not a problem for end point completers that are designed to handle each MWr as a standalone operation (i.e. not dependent on prior or future MWr's). This will be a problem if the end-point assumes TLPs will arrive in sequential order - loading them directly into a FIFO. They will need to be re-ordered. Also, the size may not always be 64 bytes.
  • The CPU can burst data to the endpoint, but it can not retrieve data in a burst. WC will not be effective in a bi-directional type application. It only has performance gains when writing to the device.
  • WC is driven by software running on the CPU. Throughput will be directly related to software and CPU load. It will not be deterministic.

Conclusion
The Lattice PCIe reference designs do not use Write Combining. All CPU accesses are considered control plane accesses (1 DWord per TLP). For throughput applications, a DMA engine, intergrated with the PCIe core, initiates all MRd/MWr operations to PC system memory to move data between the PCIe endpoint and system memory at hardware speeds.