## 30.5 A 1.41pJ/b 56Gb/s PAM-4 Wireline Receiver Employing Enhanced Pattern Utilization CDR and Genetic Adaptation Algorithms in 7nm CMOS

Shayan Shahramian<sup>\*1</sup>, Behzad Dehlaghi<sup>\*1</sup>, Joshua Liang<sup>1</sup>, Ryan Bespalko<sup>1</sup>, Dustin Dunwell<sup>1</sup>, James Bailey<sup>1</sup>, Bo Wang<sup>1</sup>, Alireza Sharif-Bakhtiar<sup>1</sup>, Michael O'Farrell<sup>1</sup>, Kerry Tang<sup>1</sup>, Anthony Chan Carusone<sup>2</sup>, David Cassan<sup>1</sup>, Davide Tonietto<sup>3</sup>

<sup>1</sup>Huawei Technologies, Toronto, Canada <sup>2</sup>University of Toronto, Toronto, Canada <sup>3</sup>Huawei Technologies, Ottawa, Canada \*Equally-Credited Authors (ECAs)

Analog mixed-signal (AMS) receivers for 50+Gb/s PAM-4 offer lower power than ADC-DSP receivers [1-3]. Those using DFEs [2-3] suffer from relatively high power consumption due to the large number of latches needed in PAM-4 speculative DFEs. Better power efficiency can be achieved using only a CTLE [1]. However, analog front-ends (AFEs) are sensitive to variations in process, supply voltage and temperature. To combat this while accommodating links with loss exceeding 20dB, an AFE with extensive programmability is combined with an efficient genetic adaptation algorithm to select a setting that minimizes BER thus equalizing a 22dB-loss channel. The lack of a DFE, combined with a novel PAM-4 clock recovery scheme greatly reduces the number of latches required compared to previous works, resulting in 1.41pJ/bit power consumption in 7nm CMOS technology.

Figure 30.5.1 shows the quarter-rate receiver block diagram with the termination and CTLE schematics inset. Twelve guarter-rate latches sample the top/middle/bottom eyes of the PAM-4 signal, and only 4 quarter-rate edge latches are used for clock recovery. The receiver termination employs shunt inductive peaking in series with the 50 $\Omega$  termination (15 adjustable settings) to extend the bandwidth. The CTLE is composed of three CML stages with source degeneration. The first stage uses resistive (6 settings) and capacitive (15 settings) source degeneration to allow adjustment of the low- and high-frequency gain. There is also midband shaping included in the first stage via a series-connected degeneration resistance and capacitance (25 adjustable settings). The second stage includes source resistance degeneration (6 settings) to allow for further adjustment of the low-frequency gain. The third stage acts as a buffer to drive the latches. All three stages employ squelch capacitances at their outputs to further fine-tune the high-frequency gain (7 settings total) as well as load resistance trim (8 settings total). In total, the AFE has over 10 million different combinations that can be chosen. While the large number of settings allow for flexibility to equalize different channels, the large number and interdependence of the different controls makes it difficult to adapt to the optimal setting.

Figure 30.5.2 explains the adaptation algorithm used to adapt the front-end settings. It is loosely based on a class of genetic optimization algorithms [4]. Genetic algorithms have been extensively studied for automated design of analog circuits [5] but here we use it for the adaptation of analog circuits over channel, process, voltage and temperature variations. The algorithm relies partly on randomness in choosing coefficients to help it avoid local minima in the performance surface that could cause sub-optimal convergence in traditional gradient descent algorithms. Figure 30.5.2 illustrates the procedure on a 2-D performance surface (i.e. AFE with only 2 knobs) to allow for better visualization, but it is applied to the AFE's 7-D performance surface. The cost function being optimized is the vertical eye opening measured using the on-chip eye monitor. For PAM-4 operation, either the bottom or top eve opening are used since these outer eyes see more impairment from non-linearity. Initially, in step 1, 15 random children (AFE settings) are chosen as an initial population for the algorithm. The eye opening is measured at each of these settings (children) and the top 3 are chosen to become parents. These are then combined with additional new children created in steps 2 to 4, to form the population for the next iteration of the algorithm. In step 2 of the algorithm, the 3 parents "evolve" into 3 new children by taking the arithmetic mean of each pair of surviving parents. In step 3, 3 additional children are created by applying controlled mutations to the parents from step 2 by choosing new random children within some set distance away from the parents; these help the population descend the performance surface quickly while preventing the children from wandering too far from the optimal value. Step 4 involves adding 6 completely random mutations (random settings) to help explore other parts of the performance surface and help avoid getting trapped in local minima. The children generated in steps 2, 3, and 4 are combined with the parents and are used in the next iteration of the algorithm in step 5. Steps 2-5 are then repeated. The number of required iterations depends on the number of settings and the performance surface. More iterations allow the algorithm to do a better job of scanning the performance surface. Over time, the randomness of the algorithm can be reduced to only track variations due to supply voltage and temperature.

The clock recovery method is understood considering the PAM-4 transitions shown in Fig. 30.5.3 where they are decomposed into good and bad transitions. The location of the good transitions is the optimal lock point for the edge clock. To utilize all good transitions, 3 sets of edge latches are required (12 latches for a quarter-rate system) which increases the power and area of the receiver. Alternatively, only the center transitions can be used with pattern filtering to ignore the bad transitions [1-2], however, this lowers tracking bandwidth due to the attendant reduction in transition-density. In the proposed receiver, only 1 set of edge latches (4 latches for a quarter-rate system) is used, however, instead of ignoring the bad transitions, they are used advantageously to help both the CDR lock time and to increase the tracking bandwidth. The phase detector (PD) logic is shown in Fig. 30.5.3. For the good transitions, the phase detector behaves like a conventional bang-bang PD. For the bad transitions, the edge sample is still used to indicate when the clock is "very" early/late. If the sampling point falls within the very early/late region, a larger shift in the sampling point is applied by the PD than when detecting good transitions. Using this PD logic reduces the receiver area/power (uses 4 edge latches vs. 12 edge latches) and reduces the capacitive loading on the CTLE which translates into additional power savings.

Figure 30.5.4 shows the measured system performance for two different channels. The channels are comprised of cables, connectors, an ISI board, an evaluation board, and the chip package. Channel A has a loss of 17.8dB and Channel B has a loss of 22.3dB at 14GHz as shown in Fig. 30.5.6. A Keysight (M8045A) Pattern generator is used as the PAM-4 transmitter using 5dB of pre-emphasis. Figure 30.5.4 a shows the measured contours captured using the on-chip eye monitor for a PRBS31 pattern and for channel B; the top and bottom eyes have 33mV of opening while the middle eye has an opening of 38mV at a BER of 1E-6. Figure 30.5.4b shows the measured bathtub curve and shows an opening of 0.17UI for channel A and 0.12UI for channel B at a BER of 1E-6. There is an upper bound in the measurable BER of 1E-2 due the counter size used in the on chip BER checker.

Figure 30.5.5(left) shows the measured receiver jitter tolerance at a BER of 1E-6 for channels A and B. The receiver meets the CEI-56G-VSR-PAM4 jitter tolerance mask for both channels. Figure 30.5.5(right) shows the measured jitter histogram from a Keysight sampling oscilloscope (DCA-X 86100D). The RJ is 240fs when the CDR is frozen and is 600fs when the CDR is in tracking mode.

Figure 30.5.6 shows the power breakdown of the receiver. Figure 30.5.6 also shows a table comparing this work to previous receivers at similar data-rates, including clock distribution but not clock generation power. This work consumes 50% less power and 40% less area than other AMS receivers operating at similar data rates and channel losses. For a fair comparison to previous work, we report the receiver power both with/without regulators. The power efficiency with regulators is 1.87pJ/bit (reported from the regulator supply), however, for a fair comparison to previous works, the energy efficiency of 1.41pJ/bit should be used (reported from the regulated supply voltage). A chip photo with area breakdown is shown in Fig. 30.5.7.

## References:

[1] E. Depaoli, et al. "A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS", *ISSCC*, pp. 112-114, Feb. 2018.

[2] P.-J. Peng, et al. "A 56Gb/s PAM-4/NRZ Transceiver in 40nm CMOS", *ISSCC*, pp.110-111, Feb. 2017.

[3] J. Im, et al. "A 40-to-56Gb/s PAM-4 Receiver with 10-Tap Direct Decision-Feedback Equalization in 16nm FinFET", *ISSCC*, pp. 114-115, Feb. 2017.

[4] D. E. Goldberg, "Genetic Algorithms in Search, Optimization and Machine Learning", Addison-Wesley, Reading MA USA 1989.

[5] H. Shibata, et al. "Analog circuit synthesis by superimposing of sub-circuits," *IEEE ISCAS*, pp.427-430, 2001.







Figure 30.5.3: PAM-4 transitions and proposed phase detector logic.





Figure 30.5.2: Genetic adaptation algorithm procedure for optimization of the analog front-end parameters.

All Children used in next iteration

△ Random Mutations

Second Iteration (All Steps)



Figure 30.5.4: Measured RX eye diagram and bathtub curve at 56.25Gb/s using 17.8dB and 22.3dB loss channels with a PRBS31 Pattern.



Figure 30.5.6: Channel insertion loss, power breakdown and comparison table.

| Breakdown of                           | Active Area                           |
|----------------------------------------|---------------------------------------|
| Block                                  | Dimensions                            |
| AFE                                    | 97 μm x 205 μm                        |
| Latches & Deserializer                 | 50 μm x 190 μm                        |
| Phase Interpolators                    | 140 μm x 127 μm                       |
| Reference DACs                         | 90 μm x 90 μm                         |
| Regulator Core                         | 98 μm x 40 μm                         |
| Rx L<br>30.5.7: 7nm FinFET chip microg | μm<br>ane<br>raph and area breakdown. |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |
|                                        |                                       |