Author on the Run
ApoorvCTF Forensics Writeup - Keyboard Audio Leakage#
Challenge summary#
Description:
No time to explain! The organizers are after me — I stole the flag for you, by sneakily recording their keyboard.
I managed to capture their keyboard keypresses before the event— every key (qwertyuiopasdfghjklzxcvbnm) pressed 50 times—don’t ask how. Then, while they were uploading the real challenge flag to CTFd, I left a mic running and recorded every keystroke.
Now I’m on the run If the organizers catch you with this, you never saw me. Good luck — and hurry!
We are given two WAV files:
Reference.wav(training capture)flag.wav(the real typed message)
Story hint says the attacker recorded each key from qwertyuiopasdfghjklzxcvbnm 50 times, then recorded the organizer typing the flag.
Expected format: apoorvctf{decoded_text}
Objective#
Recover the text typed in flag.wav using Reference.wav as labeled training audio.
Initial triage#
I first verified basic metadata.
file "Reference.wav" "flag.wav"bashOutput:
Reference.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz
flag.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 HztextThen checked durations and audio parameters in Python:
import wave
for f in ["Reference.wav", "flag.wav"]:
w = wave.open(f, "rb")
print(
f,
"channels", w.getnchannels(),
"rate", w.getframerate(),
"width", w.getsampwidth(),
"frames", w.getnframes(),
"duration", w.getnframes() / w.getframerate(),
)pythonObserved:
Reference.wavis long (~304.6 s), consistent with many sample keypresses.flag.wavis short (~12.25 s), consistent with a short typed message.
Attack plan#
- Detect keypress onsets in both files using short-term energy.
- Build per-letter templates from the reference file.
- Classify each keypress in
flag.wavby similarity to templates. - Wrap decoded text as
apoorvctf{...}.
Important assumption (from prompt):
- The 1300 reference keypresses are in blocks of 50 per letter in keyboard-order string:
qwertyuiopasdfghjklzxcvbnm
So labels are:
- first 50 onsets ->
q - next 50 ->
w - …
- last 50 ->
m
Solver script (full code used)#
Save as solve.py in the same directory as the WAV files:
#!/usr/bin/env python3
import wave
import numpy as np
KEYS = "qwertyuiopasdfghjklzxcvbnm"
def load_wav(path: str):
w = wave.open(path, "rb")
x = np.frombuffer(w.readframes(w.getnframes()), dtype=np.int16).astype(np.float32)
return x, w.getframerate()
def detect_onsets(x: np.ndarray, sr: int, min_gap_s: float, thr_mul: float = 4.0):
"""
Detect keypress peaks from a smoothed absolute-amplitude envelope.
"""
win = max(1, int(0.003 * sr))
env = np.convolve(np.abs(x), np.ones(win) / win, mode="same")
th = np.median(env) + thr_mul * np.std(env)
min_gap = int(min_gap_s * sr)
peaks = []
amps = []
i = 0
n = len(env)
while i < n:
if env[i] > th:
j = min(n, i + min_gap)
k = i + int(np.argmax(env[i:j]))
peaks.append(k)
amps.append(env[k])
i = j
else:
i += 1
return np.array(peaks), np.array(amps)
def segment(x: np.ndarray, idx: int, L: int, pre: int):
"""
Extract fixed-length window around onset with zero padding.
"""
s = idx - pre
out = np.zeros(L, dtype=np.float32)
if s < 0:
take = x[: max(0, s + L)]
out[-s : -s + len(take)] = take
else:
take = x[s : s + L]
out[: len(take)] = take
return out
def feat_time(v: np.ndarray):
v = v - np.mean(v)
return v / (np.linalg.norm(v) + 1e-9)
def feat_fft(v: np.ndarray, bins: int = 300):
sp = np.abs(np.fft.rfft(v * np.hanning(len(v))))
f = np.log1p(sp)[:bins]
return f / (np.linalg.norm(f) + 1e-9)
def main():
xr, sr = load_wav("Reference.wav")
xf, sf = load_wav("flag.wav")
assert sr == sf, "Sample rates must match"
# 1) Detect reference onsets.
pr, ar = detect_onsets(xr, sr, min_gap_s=0.12, thr_mul=4.0)
# Detector catches a few extras; keep strongest 1300 (= 26*50).
keep = np.argsort(ar)[-1300:]
pr = np.sort(pr[keep])
# 2) Build labels by 50-key blocks.
labels = np.array([KEYS[i // 50] for i in range(1300)])
# 3) Feature extraction setup.
L = int(0.10 * sr) # 100 ms window
pre = int(0.008 * sr) # 8 ms pre-onset
R_time = np.array([feat_time(segment(xr, p, L, pre)) for p in pr])
R_fft = np.array([feat_fft(segment(xr, p, L, pre), bins=min(400, L // 2 + 1)) for p in pr])
# 4) Class centroids for each key.
C_time = {}
C_fft = {}
for k in KEYS:
ct = R_time[labels == k].mean(0)
cf = R_fft[labels == k].mean(0)
C_time[k] = ct / (np.linalg.norm(ct) + 1e-9)
C_fft[k] = cf / (np.linalg.norm(cf) + 1e-9)
# 5) Detect flag onsets.
# min_gap=0.16 suppresses occasional double-trigger from same key hit.
pf, _ = detect_onsets(xf, sf, min_gap_s=0.16, thr_mul=4.0)
# 6) Classify each flag keypress by cosine score.
decoded = []
for p in pf:
ft = feat_time(segment(xf, p, L, pre))
ff = feat_fft(segment(xf, p, L, pre), bins=min(400, L // 2 + 1))
best_score = -1e9
best_key = None
for k in KEYS:
score = float(ft @ C_time[k] + ff @ C_fft[k])
if score > best_score:
best_score = score
best_key = k
decoded.append(best_key)
text = "".join(decoded)
print("decoded_raw:", text)
print("flag_candidate:", f"apoorvctf{{{text}}}")
if __name__ == "__main__":
main()pythonRunning it#
python3 solve.pybashObserved decode:
decoded_raw: ohyougotthisfzrdzmntextInterpreting the decode#
Raw acoustic decode is very close to readable English and strongly suggests:
ohyougotthisfardamn
Why this is reasonable:
- Most characters decode cleanly.
- The uncertain positions are from neighboring keys with similar acoustic signatures.
ohyougotthisfardamnis a coherent phrase, whileohyougotthisfzrdzmnis not.
Final flag:
apoorvctf{ohyougotthisfardamn}textNotes on robustness#
- I tested multiple feature-window sizes and pre-onset offsets; the prefix
ohyougotthisstayed stable. - Using too-small inter-peak gap on
flag.wavcan create duplicate detections for a single keypress; increasingmin_gap_sfrom0.12to0.16fixed that. - Reference onset detector returned a few extras, so keeping the strongest 1300 events aligns exactly with the expected
26 * 50samples.