AES is the most widely used block cipher, and there are some cache side-channel attacks against it. The attacks always exploit key-dependent memory accesses because parts of AES are optimized with lookup tables. This post quickly goes over the lookup table optimizations called S-boxes and T-tables, and how such implementations can be attacked. My hope is that this can help quickly understand the attack without having to read through many papers.
1. AES in one minute
In the rest of this post, we will talk about encrypting a single block with AES-128. AES-128 encrypts a 16-byte plaintext $P$ under a 16-byte key, producing a ciphertext $C$. These 16 bytes are the state of the cipher and are often shown as a $4 \times 4$ matrix of bytes. Encrypting something with AES consists of 10 repeated rounds of operations on the start state $s^{(0)} = P \oplus K^{(0)}$ that do,
$$s^{(r+1)} \;=\; \mathrm{AddRoundKey}_{K^{(r+1)}} \circ \mathrm{MixColumns} \circ \mathrm{ShiftRows} \circ \mathrm{SubBytes}\,(s^{(r)})$$The final state $s^{(10)} = C$. The round keys $K^{(0)}, K^{(1)}, \ldots, K^{(10)}$ are derived from the master key by the key schedule. Given any round key it is easy to compute any other round key, the key schedule is fully invertible. Recovering any single round key therefore recovers the master key, so in the attack, we use $K_i$ for byte $i$ of the round key we target.
2. The S-box
$\mathrm{SubBytes}$ is usually a lookup table: a 256-byte S-box $S$, where $S[x]$ replaces every state byte $x$. The exact values of the table are irrelevant for us we only care how the table is indexed. The first round of AES computes $S[P_i \oplus K_i]$ for each byte of key and plaintext so the lookup index is the secret $P_i \oplus K_i$. If an attacker knows the plaintext byte $P_i$ and can see which index was looked up they also know the key byte $K_i$:
$$K_i \;=\; \mathrm{index}_i \oplus P_i$$In practice there are a few complications because the attacker does not observe the index directly. Instead they use a cache side-channel which allows them to see which cache lines of the table were accessed. A cache line is 64 bytes, so the 256-byte S-box spans 4 lines, and each line contains 64 S-box entries.
3. The T-tables
T-tables fold $\mathrm{SubBytes}$ and $\mathrm{MixColumns}$ into one precomputed lookup. A middle round uses four of them, with a fifth table $\mathrm{Te}_4$ for the last round. The construction is standard for a small explanation see here well covered elsewhere.
The only notable change for us is that T-table entries are bigger than S-box entries because a T-table also folds in $\mathrm{MixColumns}$, which turns each input byte into a contribution to all four bytes of an output column. Each of the 256 T-table entries is therefore a 32-bit word, making one table $256 \times 4 = 1$ KB, four times the size of an S-box.
4. Recovering the key
Now we are going to see how the attack can still recover the AES key even given only which cache lines of the table were accessed. The attacker encrypts chosen plaintexts, reads each ciphertext $C$, and can see which cache lines are touched by the encryption. The attack works by guessing a key byte, computing the internal state that the guess would create and comparing that to the lines that were touched. If the key byte matches the guess, the state will also match. Non-matching guesses create different cache states under some plaintexts.
There are two places in AES where we can compute the internal state from the information we have. In the first round the index is $P_i \oplus K_i$, and we know the plaintext $P$, so a guess of $K_i$ gives the index. In the last round the index is $\mathrm{InvS}[C_i \oplus K_i]$, and we know the ciphertext $C$ (the inverse S-box is public), so a guess of $K_i$ gives the index.
While the first round seems like a simpler target, attacks on it are not really practical. The problem is that the index $P_i \oplus K_i$ only gives us the high bits of $K_i$ independent of the plaintext we pick. For a T-table that is 4 bits times 16 bytes, half the key, leaving a $2^{64}$ brute force for the rest. Brute forcing the remaining key bits is not really an option so we instead attack the last round.
In the last round we can recover the entire key. Round 10 has no $\mathrm{MixColumns}$, just $\mathrm{SubBytes}$, $\mathrm{ShiftRows}$, $\mathrm{AddRoundKey}$. So a ciphertext byte is one S-box output XORed with a key byte, where $\mathrm{state}_i$ is the byte going into the last $\mathrm{SubBytes}$ and also the leaked index:
$$C_i = S[\,\mathrm{state}_i\,] \oplus K_i$$We know $C_i$ and want to know $\mathrm{state}_i$ because it is the lookup index into the S-box. By XORing $K_i$ onto both sides, it cancels on the right and we get:
$$C_i \oplus K_i = S[\,\mathrm{state}_i\,]$$Then undo the S-box with its inverse $\mathrm{InvS}$:
$$\mathrm{state}_i = \mathrm{InvS}[\,C_i \oplus K_i\,]$$To test a guess $g$, compute $\mathrm{InvS}[C_i \oplus g]$ and check whether its line was touched.
This time the leak gives the whole byte, not just the top bits. The reason is that we have $S[\,\mathrm{state}_i\,] \oplus K_i$ instead of $S[P_i \oplus K_i]$ and the S-box is nonlinear. So even a low bit of $K_i$ changes which line is hit. For example: $\mathrm{InvS}[\mathtt{0x00}] = \mathtt{0x52}$ but $\mathrm{InvS}[\mathtt{0x01}] = \mathtt{0x09}$, a one-bit change that jumps from line 5 to line 0. So every bit of $K_i$ leaks given we try enough ciphertexts.
5. Cache attack difference between T-tables and S-boxes
From what we know about T-tables and S-box now it seems like the attack works quite similarly for both. While this is true in principle there are some practical differences. Given the T-table one can monitor cache after one round of AES encryption and still mount the attack.
This is because a T-table spans 16 cache lines, and the 16 lookups done after one round of AES encryption touch only about 10 of them on average. A single lookup misses a given line with probability $15/16$, so all 16 miss it with probability $(15/16)^{16} \approx 0.36$. The expected number of lines hit is therefore $16\,(1 - (15/16)^{16}) \approx 10$. An attacker can rule out the 6 lines that were not hit and rule out all of the key guesses that would have hit those lines.
The S-box spans 4 lines, which 16 lookups saturate, so observing after one round gives the attacker no usable information. When attacking S-boxes the attacker needs to measure each lookup separately which requires more precise timing and makes the attack more complicated.