Now that we have freed all these registers the design fits easily and we are left with some spare logic. We could try to use that logic to pipeline a bit more and gain some speed.
    This is how one stage looks when the S-box is expanded. This is where the pipelining registers are inserted right now. And these are the 3 additional options where we could place another set of registers.
    Note that with the LFSR we saved 56 registers per stage and it seems unreasonable to find 96 or even 256 registers per stage for additional pipelining. However, some of these registers come free with the LUTs. For example, the cut right in the middle, would require 256 registers, but 4 out of 6 of the registers in an S-box would come free with the LUTs as it can be seen in this simplified block diagram of an LE. This effectively reduces the number of additional registers required to only 128 per stage.
    Still none of the additional options could be implemented because of     insufficient resources on the chip.

.