Build an Audio Processing and Mastering Pipeline

The Problem

A podcast production company manages 12 shows with varying recording setups. Hosts record on everything from studio condenser microphones to laptop built-in mics on conference calls. Every episode arrives with different volume levels, background noise, sample rates, and formats (WAV, FLAC, MP3, M4A). The audio engineer manually processes each file in Audacity: normalizing loudness, removing silence, applying noise reduction, adding compression, and exporting at the correct specs for Apple Podcasts (-16 LUFS, 44.1 kHz, 128 kbps CBR MP3). Processing a single 60-minute episode takes 45 minutes of hands-on work, and with 12 shows per week, audio mastering is a full-time job. The audio engineer is manually applying the same chain of effects to every file, with the only variation being the specific EQ and compression settings for each show.

The Solution

Use SoX (Sound eXchange) for command-line audio processing -- normalization, noise reduction, silence trimming, format conversion, and effects chains -- and audiowaveform to generate visual waveform representations for quality validation, episode thumbnails, and web player displays.

Step-by-Step Walkthrough

1. Normalize and standardize incoming audio

Create a SoX processing chain that takes any input format and produces a consistent baseline: correct sample rate, bit depth, channel configuration, and loudness level. This normalization step ensures every episode enters the mastering pipeline in the same state regardless of how it was recorded.

Write a SoX command chain that processes raw podcast recordings into a standardized format. Accept any input format (WAV, FLAC, MP3, M4A). Convert to 44.1 kHz, 16-bit, mono WAV. Apply a high-pass filter at 80 Hz to remove low-frequency rumble. Normalize peak volume to -1 dB. Apply dynamic range compression with a 4:1 ratio above -20 dB threshold, 10ms attack, and 100ms release. Output the processed file alongside a stats summary showing RMS level, peak level, and dynamic range.

The stats summary is the engineer's first quality check: if the RMS level after compression is below -22 dB, the source recording was too quiet and may need additional gain or a different compression ratio.

2. Remove silence and trim dead air

Automatically detect and trim extended silence from episode beginnings, endings, and any dead air gaps longer than 3 seconds within the recording. This step alone typically shaves 5-15% off an episode's runtime without affecting the listening experience.

Process the normalized audio to handle silence. Trim leading silence until the first audio above -40 dB. Trim trailing silence after the last audio above -40 dB. For internal gaps, compress any silence longer than 2 seconds down to 0.8 seconds -- do not remove it entirely, as brief pauses sound natural, but 10-second gaps where someone was looking up a URL should be shortened. Use SoX's silence effect to detect gaps and the pad effect to standardize pause duration. Output a log of how many seconds of silence were removed.

The -40 dB threshold matters: too aggressive (like -30 dB) and you start clipping quiet breaths that sound natural, too lenient (like -50 dB) and you miss the dead air entirely.

3. Apply show-specific mastering profiles

Different shows have different sonic profiles. Create named presets that apply the right EQ, compression, and loudness target for each show. The profile approach means the batch processor just needs to know the show name -- all the audio engineering decisions are encoded in the preset.

Create SoX processing profiles for three of our shows. The interview show: gentle EQ boost at 3 kHz for voice clarity, light compression (3:1), target -16 LUFS for Apple Podcasts. The narrative storytelling show: warmer EQ with a slight bass boost at 200 Hz, heavier compression (5:1) for consistent volume during dramatic sections, target -16 LUFS. The news recap show: aggressive high-pass at 100 Hz, de-ess by reducing 5-8 kHz by 3 dB, fast compression (2:1, 5ms attack) for punchy delivery, target -14 LUFS for Spotify optimization. Each profile should be a reusable SoX command that takes input and output paths.

The different LUFS targets reflect platform-specific requirements: Apple Podcasts normalizes to -16 LUFS, while Spotify normalizes to -14 LUFS. Mastering to the target prevents the platform from applying its own normalization, which can degrade audio quality.

4. Generate waveform visualizations for quality check and thumbnails

Produce waveform images for visual quality validation and audiogram assets for social media promotion. A visual waveform reveals problems in seconds that would take minutes of listening to discover.

After mastering, generate audiowaveform visualizations for each episode. Create a full-episode overview waveform at 1920x200 pixels in PNG format with a dark background (#1a1a2e) and green waveform (#22c55e) for our web player. Generate a zoomed waveform of the first 60 seconds at 1920x400 pixels for visual QA -- the engineer should be able to spot clipping, dead air, or abnormal dynamics at a glance. Also create a square 1080x1080 waveform image with the show logo overlaid for the episode's Instagram post. Output the waveform data as JSON for our web-based audio player that renders interactive waveforms.

5. Batch process a week's episode queue

Wrap everything into a batch pipeline that processes all 12 weekly episodes overnight and produces mastered audio plus all visual assets.

Create a batch processing script that reads a manifest file listing this week's 12 episodes with their audio paths, show names, and episode metadata. For each episode, apply the correct show mastering profile, trim silence, generate all waveform assets, export the final mastered MP3 at 128 kbps CBR with proper ID3 tags (show name, episode title, episode number, artwork), and validate that the output meets Apple Podcasts loudness specs. Output a QA report per episode showing peak level, integrated loudness, silence removed, and final file size.

Real-World Example

The production company deployed the pipeline on a Tuesday and ran it against the week's first batch of 6 episodes overnight. The SoX processing chain handled everything from a pristine studio recording to a remote interview where one host was on a laptop mic in a hotel room. The hotel recording came in at -28 LUFS with noticeable HVAC hum -- the high-pass filter at 80 Hz cleaned the rumble, compression brought the level up to -16 LUFS, and the show-specific EQ profile added clarity to the host's voice.

Silence trimming removed a combined 14 minutes of dead air across the 6 episodes. One episode alone had a 4-minute gap where the guest had stepped away to take a phone call during recording.

The audiowaveform QA images revealed one episode had a 2-second clipping spike at the 34-minute mark from the host bumping the microphone -- something the batch stats flagged as a peak above -0.5 dB but that the visual waveform made immediately obvious as a sharp red spike in the otherwise smooth green waveform. The engineer fixed that one spot manually in 30 seconds instead of listening through the entire episode. Weekly audio processing time dropped from 45 hours to 3 hours of QA review, and the consistent mastering quality across all 12 shows made the network's podcast feed sound professional regardless of each host's recording setup.

LIVETry this use case on your own files

Skills stack · 2 skills

sox

audiowaveform