Author on the Run • 0xnhl

Challenge#

Name: Author on the Run
Category: Forensics
Description: No time to explain! The organizers are after me — I stole the flag for you, by sneakily recording their keyboard. I managed to capture their keyboard keypresses before the event— every key pressed 50 times. Then, while they were uploading the real challenge flag to CTFd, I left a mic running and recorded every keystroke.
Given files: Reference.wav, flag.wav
Expected flag format: apoorvctf{...}

Objective#

Recover the text typed in flag.wav using Reference.wav as labeled training audio.

Initial triage#

I first verified basic metadata.

file "Reference.wav" "flag.wav"

bash

Output:

Reference.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz
flag.wav:      RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz

text

Then checked durations and audio parameters in Python:

import wave

for f in ["Reference.wav", "flag.wav"]:
    w = wave.open(f, "rb")
    print(
        f,
        "channels", w.getnchannels(),
        "rate", w.getframerate(),
        "width", w.getsampwidth(),
        "frames", w.getnframes(),
        "duration", w.getnframes() / w.getframerate(),
    )

python

Observed:

Reference.wav is long (~304.6 s), consistent with many sample keypresses.
flag.wav is short (~12.25 s), consistent with a short typed message.

Attack plan#

Detect keypress onsets in both files using short-term energy.
Build per-letter templates from the reference file.
Classify each keypress in flag.wav by similarity to templates.
Wrap decoded text as apoorvctf{...}.

Important assumption (from prompt):

The 1300 reference keypresses are in blocks of 50 per letter in keyboard-order string:
qwertyuiopasdfghjklzxcvbnm

So labels are:

first 50 onsets -> q
next 50 -> w
…
last 50 -> m

Solver script (full code used)#

Save as solve.py in the same directory as the WAV files:

#!/usr/bin/env python3
import wave
import numpy as np

KEYS = "qwertyuiopasdfghjklzxcvbnm"


def load_wav(path: str):
    w = wave.open(path, "rb")
    x = np.frombuffer(w.readframes(w.getnframes()), dtype=np.int16).astype(np.float32)
    return x, w.getframerate()


def detect_onsets(x: np.ndarray, sr: int, min_gap_s: float, thr_mul: float = 4.0):
    """
    Detect keypress peaks from a smoothed absolute-amplitude envelope.
    """
    win = max(1, int(0.003 * sr))
    env = np.convolve(np.abs(x), np.ones(win) / win, mode="same")
    th = np.median(env) + thr_mul * np.std(env)
    min_gap = int(min_gap_s * sr)

    peaks = []
    amps = []
    i = 0
    n = len(env)

    while i < n:
        if env[i] > th:
            j = min(n, i + min_gap)
            k = i + int(np.argmax(env[i:j]))
            peaks.append(k)
            amps.append(env[k])
            i = j
        else:
            i += 1

    return np.array(peaks), np.array(amps)


def segment(x: np.ndarray, idx: int, L: int, pre: int):
    """
    Extract fixed-length window around onset with zero padding.
    """
    s = idx - pre
    out = np.zeros(L, dtype=np.float32)

    if s < 0:
        take = x[: max(0, s + L)]
        out[-s : -s + len(take)] = take
    else:
        take = x[s : s + L]
        out[: len(take)] = take

    return out


def feat_time(v: np.ndarray):
    v = v - np.mean(v)
    return v / (np.linalg.norm(v) + 1e-9)


def feat_fft(v: np.ndarray, bins: int = 300):
    sp = np.abs(np.fft.rfft(v * np.hanning(len(v))))
    f = np.log1p(sp)[:bins]
    return f / (np.linalg.norm(f) + 1e-9)


def main():
    xr, sr = load_wav("Reference.wav")
    xf, sf = load_wav("flag.wav")
    assert sr == sf, "Sample rates must match"

    # 1) Detect reference onsets.
    pr, ar = detect_onsets(xr, sr, min_gap_s=0.12, thr_mul=4.0)

    # Detector catches a few extras; keep strongest 1300 (= 26*50).
    keep = np.argsort(ar)[-1300:]
    pr = np.sort(pr[keep])

    # 2) Build labels by 50-key blocks.
    labels = np.array([KEYS[i // 50] for i in range(1300)])

    # 3) Feature extraction setup.
    L = int(0.10 * sr)   # 100 ms window
    pre = int(0.008 * sr)  # 8 ms pre-onset

    R_time = np.array([feat_time(segment(xr, p, L, pre)) for p in pr])
    R_fft = np.array([feat_fft(segment(xr, p, L, pre), bins=min(400, L // 2 + 1)) for p in pr])

    # 4) Class centroids for each key.
    C_time = {}
    C_fft = {}
    for k in KEYS:
        ct = R_time[labels == k].mean(0)
        cf = R_fft[labels == k].mean(0)
        C_time[k] = ct / (np.linalg.norm(ct) + 1e-9)
        C_fft[k] = cf / (np.linalg.norm(cf) + 1e-9)

    # 5) Detect flag onsets.
    # min_gap=0.16 suppresses occasional double-trigger from same key hit.
    pf, _ = detect_onsets(xf, sf, min_gap_s=0.16, thr_mul=4.0)

    # 6) Classify each flag keypress by cosine score.
    decoded = []
    for p in pf:
        ft = feat_time(segment(xf, p, L, pre))
        ff = feat_fft(segment(xf, p, L, pre), bins=min(400, L // 2 + 1))

        best_score = -1e9
        best_key = None
        for k in KEYS:
            score = float(ft @ C_time[k] + ff @ C_fft[k])
            if score > best_score:
                best_score = score
                best_key = k

        decoded.append(best_key)

    text = "".join(decoded)
    print("decoded_raw:", text)
    print("flag_candidate:", f"apoorvctf{{{text}}}")


if __name__ == "__main__":
    main()

python

Running it#

python3 solve.py

bash

Observed decode:

decoded_raw: ohyougotthisfzrdzmn

text

Interpreting the decode#

Raw acoustic decode is very close to readable English and strongly suggests:

ohyougotthisfardamn

Why this is reasonable:

Most characters decode cleanly.
The uncertain positions are from neighboring keys with similar acoustic signatures.
ohyougotthisfardamn is a coherent phrase, while ohyougotthisfzrdzmn is not.

Final flag:

apoorvctf{ohyougotthisfardamn}

text

Notes on robustness#

I tested multiple feature-window sizes and pre-onset offsets; the prefix ohyougotthis stayed stable.
Using too-small inter-peak gap on flag.wav can create duplicate detections for a single keypress; increasing min_gap_s from 0.12 to 0.16 fixed that.
Reference onset detector returned a few extras, so keeping the strongest 1300 events aligns exactly with the expected 26 * 50 samples.