audiolla

Thirty audio engines. One port. Zero cloud. Fire-and-forget async jobs. Webhooks.

You needed Demucs for stems. Then librosa for BPM and key. Then basic-pitch for MIDI transcription. Then pyannote for speaker diarization. Then DeepFilterNet for speech enhancement. Then you spent three days debugging Python version conflicts and now you hate everything.

audiolla is what happens when you stop doing that.

Every audio processing tool worth using — wrapped in one HTTP API, running in one Docker container. POST a file. Get audio, JSON, or MIDI back. Drive it from curl, shell scripts, Python notebooks, Makefiles, or point an LLM agent at the MCP endpoint and let it rip.

No account. No subscription. No per-minute billing. No vendor lock-in. docker run and you're done.

What's in the box


🎛️ Stem separation	Demucs — htdemucs, fine-tuned, 6-stem, MDX variants
🎚️ Mastering	Reference mastering (matchering) + custom pedalboard chains
📊 Analysis	BPM · key · LUFS · beats · onsets · melody · structural segments
🎹 Chords + key	Chord detection + Krumhansl-Schmuckler key estimation
🎵 Audio → MIDI	Polyphonic transcription via Spotify's basic-pitch (ONNX, no TF)
🧹 Restoration	De-reverb · de-echo · de-noise via UVR BS-Roformer + MelBand Roformer
🗣️ Speech	Enhancement (DeepFilterNet) · VAD (silero-vad) · diarization (pyannote)
🖼️ Visuals	Spectrogram + waveform PNGs + 8-mode animated MP4/WebM
🔍 Fingerprint	Chromaprint acoustic fingerprinting (AcoustID-compatible)
✂️ Silence	Detect gaps · trim edges · strip all silence
🎼 MIDI pipeline	Compose from JSON · inspect · transform · render via fluidsynth
🎸 Effects	23-effect pedalboard chain — Compressor, Reverb, PitchShift, filters…
🔧 Transforms	Sox DSP — pitch, tempo, EQ, reverb, gain
📢 Loudness	Measure LUFS · normalize to target
🥁 HPSS	Harmonic/percussive source separation via librosa median filter
🔇 Noise reduction	Spectral noise reduction via noisereduce — stationary + adaptive modes
⏩ Time-stretch	Independent tempo factor + pitch shift via librosa phase vocoder
🏷️ Audio tagging	Top-K AudioSet class labels via Audio Spectrogram Transformer
🔗 Audio embeddings	512-dim semantic embeddings via LAION CLAP + optional text similarity
🏷️ Zero-shot classify	CLAP cosine similarity against any free-form text labels — genres, moods, instruments
📋 Audio info	ffprobe metadata — duration, sample rate, channels, codec, bit depth
✂️ Trim	Cut a clip by start/end seconds — any format in, any format out
🎚️ Mix	Combine N staged tracks with per-track gain_db — pure ffmpeg, no model
🔗 Concat	Stitch N audio files end-to-end in order
⏩ Speed	Change playback speed without pitch shift (0.1× – 10×) via ffmpeg atempo
🔄 Convert	Re-encode: format, sample rate, channel count in one call
🔍 Similar	Cosine similarity between two audio files via CLAP embeddings
🎹 MIDI quantize	Snap MIDI note timings to a rhythmic grid (16th, 8th, quarter…)
🌅 Fade	Fade-in and/or fade-out with 13 curve shapes
⏪ Reverse	Flip audio backwards
🔁 Loop	Repeat audio N times
🎯 BPM match	Auto-detect BPM then stretch to a target — no manual math
📈 Loudness curve	RMS envelope over time — time-stamped dB values for gain automation
🎤 Pitch correct	Auto-tune toward nearest chromatic semitone — configurable strength
🔧 Repair	Declip + dehum — fix clipped peaks and remove power-line hum
🔁 Loop point	Find best seamless loop boundary — score, bar count, candidates list
🥁 Drum machine	Step-sequencer spec → GM drum MIDI — 16-step pattern, swing, tempo
🎼 Chords to MIDI	Chord progression → MIDI file — root+3rd+5th voicings per segment
↔️ Stereo width	Widen or collapse the stereo image via M/S processing
✂️ Split	Split into N equal parts or on silence — returns ZIP of segments
🔊 Pan	Position audio in the stereo field (-1 left → 0 center → 1 right)
🎚️ EQ	Parametric EQ — JSON array of freq/gain_db/width_hz bands
🎵 Key match	Detect source key then pitch-shift to a target key
🎙️ Sidechain duck	Duck music when a trigger track (voice) is loud
🏷️ Metadata	Read and write ID3/Vorbis/FLAC/WAV audio tags via mutagen
🔴 Clip detect	Detect digital clipping — count, ratio, peak dBFS
↔️ Mid/Side	Encode L/R → Mid+Side or decode Mid+Side → L/R
✂️ Beat slice	Slice audio at detected beat positions — returns ZIP of segments
🏟️ Conv reverb	Convolution reverb via impulse response — wet_mix control
🥁 Transient shaper	Attack/sustain dual-compressor — punch up drums, cut room tail
🎚️ Multiband compress	N-band compressor with zero-phase LR4 crossovers — mastering-grade dynamics
🎛️ DJ prep	One call: BPM + key + Camelot wheel position + integrated LUFS
📦 Batch	Run trim/convert/fade/reverse/speed/eq on staged files in sequence
🧩 Presets + pipeline	Curated YAML workflows (`master-for-spotify`, `podcast-cleanup`, …) + ad-hoc op chaining server-side
🗂️ Catalog	`GET /v1/catalog` — machine-readable endpoint list grouped by category for discovery
⚡ Async jobs	Every endpoint supports `async_job=true` — fire-and-forget + webhook callbacks

Run it

# no GPU
docker run --rm -it \
  -v $HOME/.audiolla-data:/data \
  -p 8000:8000 \
  psyb0t/audiolla:latest

# GPU
docker run --rm -it --gpus all \
  -v $HOME/.audiolla-data:/data \
  -e AUDIOLLA_DEVICE=cuda \
  -p 8000:8000 \
  psyb0t/audiolla:latest-cuda

Demucs weights prefetch at container startup (for whichever variants are enabled) and cache in /data/torch_cache/. First boot downloads them; same -v mount next time and they're already there. Other engines (matchering, pedalboard, librosa, sox, fx, midi) have no weights — they're ready as soon as /healthz is green.

Migration from v0.23.x → v1.0.0

v1.0.0 is a breaking API release. Every existing client breaks. The new shape:

Every audio endpoint takes a JSON body (no more multipart/form-data except at /v1/files)
Input is file_path (FILES_DIR-relative) xor file_url (server-side fetch). Pre-stage the file via PUT /v1/files/{path} first.
Output requires output_path xor output_url. No more raw audio bytes in responses.
Async path: async_job=true auto-stages to jobs/{id}.{ext} if neither output is given.
MCP audio-producing tools dropped audio_base64 (and midi_base64 / image_base64 / video_base64). Same output_path xor output_url requirement.
openapi.yaml is now the contract — Pydantic models regenerate from it via make generate. Never hand-edit src/audiolla/schema/_generated.py.

- curl -X POST http://localhost:8000/v1/audio/normalize \
-     -F "file=@track.wav" -F "target_lufs=-14" -o normalized.wav

+ # 1) stage the file (multipart only lives here now)
+ curl -X PUT --data-binary @track.wav \
+     -H 'Content-Type: application/octet-stream' \
+     http://localhost:8000/v1/files/uploads/track.wav

+ # 2) process via JSON body — response is JSON, not bytes
+ curl -X POST http://localhost:8000/v1/audio/normalize \
+     -H 'Content-Type: application/json' \
+     -d '{"file_path":"uploads/track.wav","target_lufs":-14,"output_path":"out/normalized.wav"}'

+ # 3) retrieve the result
+ curl -o normalized.wav http://localhost:8000/v1/files/out/normalized.wav

Why? See the v1.0.0 CHANGELOG entry for the full rationale.

Quick start

Once the container is up, this is a complete audio pipeline in six commands (every audio endpoint is JSON-body now; stage your input file at /v1/files/... first):

# stage your input file
curl -X PUT --data-binary @song.wav \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/song.wav

# rip the vocals out of a track
curl -X POST http://localhost:8000/v1/audio/separate \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/song.wav","engine":"htdemucs","stems":["vocals"],"output_path":"out/vocals.wav"}'
# → {"path":"out/vocals.wav","size":...,"output_format":"wav"}
curl -o vocals.wav http://localhost:8000/v1/files/out/vocals.wav

# what key is it in? what are the chords?
curl -X POST http://localhost:8000/v1/audio/chords \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/song.wav"}'
# → {"key":"F# minor","key_confidence":0.91,"chords":[{"chord":"F#m","start_sec":0.0,...},...]}

# transcribe that vocal melody to MIDI
curl -X PUT --data-binary @out/vocals.wav -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/vocals.wav  # only if not already staged
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocals.wav","output_path":"out/melody.mid"}'

# render the MIDI back to audio through a SoundFont
curl -X POST http://localhost:8000/v1/midi/render \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"out/melody.mid","output_path":"out/rendered.wav"}'

# strip background noise from a voice recording
curl -X POST http://localhost:8000/v1/audio/noise-reduce/uvr-denoise \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/interview.wav","output_path":"out/clean.wav"}'

# who's speaking and when?
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/interview.wav"}'
# → {"num_speakers":2,"segments":[{"speaker":"SPEAKER_00","start_sec":0.5,"end_sec":8.2},...]}

Audio in. MIDI out. Chords detected. Speakers identified. De-noised. Re-synthesized. No Python environment to set up. No API keys. No account. Just HTTP.

What it can do

Output defaults to wav. Add "output_format":"mp3" to the JSON body to get mp3 instead (flac, opus, aac, pcm also work).

Every audio endpoint takes an application/json body. The only place multipart still lives is PUT /v1/files/{path} (raw bytes for staging an input file).

Input — every audio endpoint requires exactly one of:

file_path — path inside the /v1/files staging area (stage with PUT /v1/files/{path} first)
file_url — remote URL the server fetches (disabled by default — see Remote URLs)

Output — audio-producing endpoints require exactly one of:

output_path — server writes to /v1/files/<path>, returns JSON {"path":..., "size":..., ...}
output_url — server PUTs to a presigned URL, returns JSON {"url":..., "size":..., ...}

Analysis-only endpoints (those that return JSON data, e.g. /v1/audio/analyze, /v1/audio/loudness, /v1/audio/info) don't need output_path / output_url — the response is the result.

Split stems

# stage input
curl -X PUT --data-binary @track.wav \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/track.wav

# vocals only
curl -X POST http://localhost:8000/v1/audio/separate \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","engine":"htdemucs","stems":["vocals"],"output_path":"out/vocals.wav"}'
curl -o vocals.wav http://localhost:8000/v1/files/out/vocals.wav

# all 4 stems as a ZIP
curl -X POST http://localhost:8000/v1/audio/separate \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","engine":"htdemucs","output_path":"out/stems.zip"}'
curl -o stems.zip http://localhost:8000/v1/files/out/stems.zip

Master

# stage track + reference
curl -X PUT --data-binary @track.wav -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/track.wav
curl -X PUT --data-binary @ref.wav -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/ref.wav

# match EQ + loudness to a reference track
curl -X POST http://localhost:8000/v1/audio/master \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","mode":"reference","reference_path":"uploads/ref.wav","output_path":"out/mastered.wav"}'
curl -o mastered.wav http://localhost:8000/v1/files/out/mastered.wav

# run a built-in pedalboard chain (presets: transparent, loud)
curl -X POST http://localhost:8000/v1/audio/master \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","mode":"chain","preset":"loud","output_path":"out/mastered.wav"}'
curl -o mastered.wav http://localhost:8000/v1/files/out/mastered.wav

Analyze

# returns JSON. features: bpm, key, loudness, duration,
# spectral_centroid, rms, zcr. Omit features to get them all.
curl -X POST http://localhost:8000/v1/audio/analyze \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","features":["bpm","key","loudness"]}'

Beats, onsets, melody, segments

# beat grid — returns bpm + beat timestamps
curl -X POST http://localhost:8000/v1/audio/beats \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'

# onset timestamps — note attacks, transients
curl -X POST http://localhost:8000/v1/audio/onsets \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'

# dominant melody contour — pitch in Hz per frame
curl -X POST http://localhost:8000/v1/audio/melody \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'

# structural segmentation — labels recurring sections A, B, C...
curl -X POST http://localhost:8000/v1/audio/segments \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","num_segments":4}'

Beat detection also generates a click-track file when click_track=true (set output_path to receive it) — handy for aligning a mix to a grid. Pass start_bpm=140 to seed the tracker when you already know the rough tempo (faster, more accurate). Melody can be exported as a single-track MIDI file via as_midi=true + output_path.

Silence detection and trimming

# find silent gaps in a recording (no trim_mode → JSON only)
curl -X POST http://localhost:8000/v1/audio/silence \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","threshold_db":-30,"min_duration_sec":1.0}'

# trim all silence and stage the result
curl -X POST http://localhost:8000/v1/audio/silence \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","threshold_db":-30,"min_duration_sec":0.5,"trim_mode":"all","output_path":"out/trimmed.wav"}'
curl -o trimmed.wav http://localhost:8000/v1/files/out/trimmed.wav

# trim only leading/trailing silence
curl -X POST http://localhost:8000/v1/audio/silence \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","threshold_db":-40,"min_duration_sec":0.3,"trim_mode":"edges","output_path":"processed/trimmed.wav"}'

trim_mode=edges — chop leading + trailing silence only. trim_mode=all — remove every detected gap (compress a talk recording, tighten a loop). Without trim_mode, the response is JSON only: silent_ranges, non_silent_ranges, duration — and output_path / output_url is not required.

Visualize (spectrogram, waveform, video)

Visual output splits into two sub-namespaces by output type:

# Static PNG spectrogram (color + scale params)
curl -X POST http://localhost:8000/v1/audio/visualize/image/spectrogram \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","width":1280,"height":720,"output_path":"out/spec.png"}'
curl -o spec.png http://localhost:8000/v1/files/out/spec.png

# Static PNG waveform (color param)
curl -X POST http://localhost:8000/v1/audio/visualize/image/waveform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","width":1280,"height":240,"output_path":"out/wave.png"}'
curl -o wave.png http://localhost:8000/v1/files/out/wave.png

# Animated MP4 spectrum analyser (fps + container params)
curl -X POST http://localhost:8000/v1/audio/visualize/video/spectrum \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","width":1280,"height":720,"fps":30,"container":"mp4","output_path":"out/viz.mp4"}'
curl -o viz.mp4 http://localhost:8000/v1/files/out/viz.mp4

/image/spectrogram: produces a PNG (staged via output_path or PUT to output_url). Params: width, height, color (default intensity), scale (log/lin).

/image/waveform: produces a PNG. Params: width, height, color (default lime).

/video/{mode}: spectrum (scrolling FFT), waves (oscilloscope), cqt (constant-Q transform), freqs (bar-graph analyzer), volume (VU meter), vectorscope (stereo X/Y scope), phasemeter, histogram. Params: width, height, fps, container (mp4 default, webm).

Acoustic fingerprint

# Chromaprint fingerprint — identifies a recording regardless of encoding
curl -X POST http://localhost:8000/v1/audio/fingerprint \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'
# → {"duration": 215.34, "fingerprint": "AQADtEqRRIuQ..."}

# include the raw integer array (for custom similarity scoring)
curl -X POST http://localhost:8000/v1/audio/fingerprint \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","return_raw":true}'

The base64 fingerprint string is compatible with the AcoustID lookup service.

De-reverb, de-echo, de-noise

AI audio restoration via UVR ecosystem models — BS-Roformer and MelBand Roformer. All three are unified under POST /v1/audio/restore/{engine}.

# Remove room reverb (BS-Roformer, SDR 19+)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-dereverb \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"out/dry.wav"}'

# Remove echo — normal mode
curl -X POST http://localhost:8000/v1/audio/restore/uvr-deecho \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"out/noecho.wav"}'

# Remove echo — aggressive mode (same engine, harder suppression)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-deecho \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","aggressive":true,"output_path":"out/noecho.wav"}'

# Remove broadband background noise — ML (MelBand Roformer, SDR 28)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-denoise \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"out/clean.wav"}'

All support output_format, output_path, output_url. For DSP-based noise reduction (no GPU) use noise-reduce/noise-reduce.

UVR engines also work through /v1/audio/separate — uvr-vocal-bsr (BS-Roformer, SDR 13) and uvr-karaoke return vocal + instrumental stems like Demucs but often with higher quality.

Audio-to-MIDI transcription

Polyphonic audio-to-MIDI via Spotify's basic-pitch (ONNX backend, no TensorFlow). Play guitar, hum a melody, record a piano riff — get a MIDI file back with all the notes.

# Any audio → MIDI file (staged)
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/guitar_riff.wav","output_path":"out/riff.mid"}'
curl -o riff.mid http://localhost:8000/v1/files/out/riff.mid

# Tune the detection thresholds
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/piano.wav","onset_threshold":0.6,"frame_threshold":0.3,"minimum_note_length_ms":80,"output_path":"out/piano.mid"}'

# Write directly to a different staging path
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"recordings/bass.wav","output_path":"midi/bass_notes.mid"}'
# → {"path":"midi/bass_notes.mid","size":...,"engine":"basic-pitch","output_format":"mid"}

Optional params: onset_threshold (0–1, default 0.5), frame_threshold (0–1, default 0.3), minimum_note_length_ms (default 58), minimum_frequency / maximum_frequency (Hz, default unconstrained), multiple_pitch_bends (bool, default false), melodia_trick (bool, default true — helps with melodic content). Default engine: basic-pitch.

The MIDI file is piped straight into /v1/midi/inspect or /v1/midi/render — audio → MIDI → audio is a complete round-trip.

Neural speech and vocal enhancement

DeepFilterNet DF3 — deep learning noise suppression trained on speech. Better than broadband de-noise for voice recordings; more surgical than UVR's de-noise on vocals specifically.

# Enhance a vocal recording
curl -X POST http://localhost:8000/v1/audio/enhance/deepfilter \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal_recording.wav","output_path":"out/enhanced.wav"}'
curl -o enhanced.wav http://localhost:8000/v1/files/out/enhanced.wav

# Stage the output as mp3
curl -X POST http://localhost:8000/v1/audio/enhance/deepfilter \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"vocals/raw.wav","output_format":"mp3","output_path":"vocals/enhanced.mp3"}'

Supports output_format, output_path, output_url.

Generate music + SFX

Text-to-audio generation under POST /v1/audio/generate/{engine}. v1.0.0 ships five engines spanning music + sound effects, with different licence / VRAM / sound profiles — all CUDA-only.

# Stable Audio Open 1.0 — 47s cap, no vocals, great for loops + SFX
curl -X POST http://localhost:8000/v1/audio/generate/stable-audio-open \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"130 bpm tech house drum loop, punchy kick, crisp hats, no vocals","duration_sec":10,"seed":42,"output_path":"out/loop.wav"}'
curl -o loop.wav http://localhost:8000/v1/files/out/loop.wav

# MusicGen 300M — 30s cap, instrumental, CC-BY-NC (opt-in required)
curl -X POST http://localhost:8000/v1/audio/generate/musicgen-small \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"lo-fi hip-hop beat with vinyl crackle, 90 bpm","duration_sec":15,"output_path":"out/beat.wav"}'

# Riffusion — spectrogram-to-audio via Griffin-Lim, ~5s, lo-fi character
curl -X POST http://localhost:8000/v1/audio/generate/riffusion \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"ambient drone with metallic resonance","output_path":"out/drone.wav"}'

# AudioLDM 2 — general SFX (no opt-in gate, CC-BY 4.0 commercial-OK)
curl -X POST http://localhost:8000/v1/audio/generate/audioldm2 \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"heavy rain on a metal roof with distant thunder","duration_sec":10,"num_inference_steps":50,"output_path":"out/rain.wav"}'

Engine details:

Engine	Licence	Max length	VRAM (fp16)	Output
`stable-audio-open`	Stability Community Licence (commercial OK below the revenue threshold)	47 s hard cap	~12 GB	44.1 kHz stereo. Loops, SFX, ambient textures — instrumental only
`musicgen-small`	CC-BY-NC 4.0 (non-commercial only — opt-in via `AUDIOLLA_ENABLE_NONCOMMERCIAL=1`)	30 s hard cap	~3 GB	32 kHz mono. Meta MusicGen 300M, instrumental
`musicgen-medium`	CC-BY-NC 4.0 (same opt-in)	30 s hard cap	~6-8 GB	32 kHz mono. Higher quality than -small
`riffusion`	CreativeML OpenRAIL-M (commercial OK with the licence's usage restrictions)	~5 s per pass	~3 GB	22.05 kHz mono. SD-style spectrogram, Griffin-Lim reconstruction — lo-fi / loop-y character
`audioldm2`	CC-BY 4.0 (commercial use OK — no opt-in gate)	30 s hard cap	~8-10 GB (CPU offload)	16 kHz mono. General SFX: ambience, foley, animal, mechanical, impact sounds. Slow (200-step DDIM default; pass `num_inference_steps=50` for ~4x speedup)

All engines support async_job=true, webhook_url, output_path, output_url, and seed for reproducibility. stable-audio-open and audioldm2 additionally accept num_inference_steps (trade quality for speed). Model weights download on first call to HF_HOME (default /data/hf inside the container — ~7 GB across all five). Subsequent calls are inference-only. All five are flagged cuda_only — non-CUDA hosts get HTTP 400.

Licence opt-in for MusicGen. MusicGen weights are CC-BY-NC 4.0. The engine code ships with the image but refuses to load the model unless the operator explicitly sets AUDIOLLA_ENABLE_NONCOMMERCIAL=1 in the server's environment. Same pattern matchering (GPL v3) follows — licence-encumbered code in the image, conscious opt-in to actually use it. Read the MusicGen weights licence before opting in. AudioLDM 2 is CC-BY 4.0 (commercial use allowed, no opt-in gate) — it's the only generator in this set that's commercial-safe without flipping any flags.

Deferred to a future release (researched but not shipped in v1.0.0):

ACE-Step v1 (3.5B, Apache 2.0, full songs with vocals up to 4 min) — requires AceStepPipeline from diffusers>=0.38, which itself requires a pre-release safetensors. Doesn't pass the project's hash-locked supply-chain gate. Revisit when safetensors 0.8.x ships stable, or vendor ACE-Step's pipeline directly.
DiffRhythm full v1.2 (Apache 2.0) — unpackaged research repo (no setup.py / PyPI release). Revisit when upstream ships a package or we vendor under thirdparty/.
Stable Audio Open Small (Stability Community Licence, 11 s SFX-specialist) — requires stable-audio-tools which pins python >=3.10, <3.11; audiolla is on Python 3.12, hard incompatibility. Revisit when stable-audio-tools widens the Python constraint or diffusers grows a pipeline for it.
TangoFlux (ICLR 2026, 44.1 kHz, 30 s, fast) — git-only install (no PyPI package). Could be SHA-pinned in the hash-locked supply chain; deferred for now to keep the heavy-deps stack PyPI-only.
AudioGen (Meta, CC-BY-NC) — audiocraft==1.3.0 pins transformers<=4.31.0, hard conflict with audiolla's 4.51.3. Would require an isolated subprocess / sidecar container.
YuE 7B (Apache 2.0, full songs with vocals) — needs 16-24 GB VRAM at fp16, doesn't fit 12 GB GPUs without int4 quant tooling.

Chord and key detection

Krumhansl-Schmuckler key estimation + chroma-template chord segmentation via librosa. No extra deps beyond the librosa stack.

curl -X POST http://localhost:8000/v1/audio/chords \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'
# → {
#     "key": "C major",
#     "key_confidence": 0.87,
#     "duration": 183.4,
#     "chords": [
#       {"chord": "C", "start_sec": 0.0, "end_sec": 2.3, "confidence": 0.91},
#       {"chord": "Am", "start_sec": 2.3, "end_sec": 4.6, "confidence": 0.85},
#       ...
#     ]
#   }

# Tune the hop length (lower = finer time resolution)
curl -X POST http://localhost:8000/v1/audio/chords \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","hop_length":256}'

Optional params: hop_length (default 512), segment_min_duration_sec (default 0.5 — merge very short chord segments).

Voice activity detection

silero-vad — ONNX-based VAD, fast and accurate on both speech and music. Returns timestamped speech and non-speech segments.

curl -X POST http://localhost:8000/v1/audio/vad \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/interview.wav"}'
# → {
#     "speech_ratio": 0.73,
#     "duration": 120.0,
#     "threshold": 0.5,
#     "speech_segments": [
#       {"start_sec": 1.2, "end_sec": 8.4},
#       ...
#     ],
#     "non_speech_segments": [
#       {"start_sec": 0.0, "end_sec": 1.2},
#       ...
#     ]
#   }

# Tighter detection
curl -X POST http://localhost:8000/v1/audio/vad \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/podcast.wav","threshold":0.7,"min_speech_duration_ms":300,"min_silence_duration_ms":200}'

Optional params: threshold (0–1, default 0.5), min_speech_duration_ms (default 250), min_silence_duration_ms (default 100).

Speaker diarization

pyannote/speaker-diarization-3.1 — state-of-the-art speaker diarization from HuggingFace Hub. Returns per-speaker timestamped segments and speaker count.

Note: This engine requires a HuggingFace account. You must accept the model terms at https://huggingface.co/pyannote/speaker-diarization-3.1 and then set HF_TOKEN (or the older alias HUGGINGFACE_TOKEN — the entrypoint mirrors them both ways) when starting the container. A read-only token with model access is enough. The same token also unlocks the gated text-to-audio engines (stable-audio-open, musicgen-small, musicgen-medium) provided you've accepted their licences on huggingface.co.

docker run ... \
  -e HF_TOKEN=hf_your_token_here \
  psyb0t/audiolla:latest

curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/interview.wav"}'
# → {
#     "num_speakers": 2,
#     "speakers": ["SPEAKER_00", "SPEAKER_01"],
#     "duration": 120.0,
#     "segments": [
#       {"speaker": "SPEAKER_00", "start_sec": 0.5, "end_sec": 8.2, "duration_sec": 7.7},
#       {"speaker": "SPEAKER_01", "start_sec": 8.5, "end_sec": 14.1, "duration_sec": 5.6},
#       ...
#     ]
#   }

# Hint the expected speaker count
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/roundtable.wav","num_speakers":4}'

# Or constrain the range
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/panel.wav","min_speakers":2,"max_speakers":6}'

Optional params: num_speakers (exact count hint), min_speakers, max_speakers.

Transform

# pitch shift up 2 semitones + add reverb, export mp3.
# operations is a JSON array — ops: gain, equalizer, compand, reverb,
# pitch, tempo, rate, channels, trim, pad.
curl -X POST http://localhost:8000/v1/audio/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","operations":[{"op":"pitch","params":{"n_semitones":2}},{"op":"reverb","params":{"reverberance":50}}],"output_format":"mp3","output_path":"out/out.mp3"}'
curl -o out.mp3 http://localhost:8000/v1/files/out/out.mp3

Loudness measurement

# Measure integrated LUFS — returns JSON, no audio output
curl -X POST http://localhost:8000/v1/audio/loudness \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'
# → {"loudness_lufs": -18.4}

Loudness curve

RMS envelope over time — returns a list of {time_sec, rms_db} points. Useful for generating gain automation curves, finding loud and quiet sections, or visualising dynamic range before mastering.

# Default hop (512 samples) — fine-grained envelope
curl -X POST http://localhost:8000/v1/audio/loudness/curve \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}' | jq '.curve[:5]'
# → [
#     {"time_sec": 0.0,   "rms_db": -18.4},
#     {"time_sec": 0.012, "rms_db": -17.9},
#     ...
#   ]

# Coarser envelope (2048-sample hop)
curl -X POST http://localhost:8000/v1/audio/loudness/curve \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","hop_length":2048}' | jq '{duration, sample_rate, points}'

Response fields: curve (array of {time_sec, rms_db}), duration (seconds), sample_rate, points (total curve length). Optional param: hop_length (default 512).

Loudness normalization

# Normalize to -14 LUFS (streaming platform standard)
curl -X POST http://localhost:8000/v1/audio/normalize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","target_lufs":-14,"output_path":"out/normalized.wav"}'
curl -o normalized.wav http://localhost:8000/v1/files/out/normalized.wav

# Write to a different staging path
curl -X POST http://localhost:8000/v1/audio/normalize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","target_lufs":-23,"output_path":"mastered/norm.wav"}'

target_lufs is required. The response JSON carries loudness_lufs with the measured pre-normalization level alongside path / url / size.

HPSS (harmonic/percussive split)

Median-filter harmonic/percussive source separation via librosa. Harmonic = tonal content (pitched instruments, pads); percussive = transients (drums, percussion). No ML — pure DSP, fast, no GPU needed.

# Get both stems in a ZIP
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"out/stems.zip"}'
curl -o stems.zip http://localhost:8000/v1/files/out/stems.zip
# → stems.zip contains harmonic.wav + percussive.wav

# Wider margin = harder separation (more aggressive)
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","margin":3.0,"output_path":"out/stems.zip"}'

# Output to a different staging path
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"hpss/stems.zip"}'

Params: margin (default 1.0 — ≥1.0, higher = more aggressive), kernel_size (default 31 — odd int, median filter width), output_format (default wav).

Spectral noise reduction

Noise reduction with two engine options under the same endpoint — pick DSP for no-GPU fast cleanup or ML for higher-quality removal.

# DSP (noisereduce) — no GPU, pure spectral subtraction + Wiener filtering
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","output_path":"out/clean.wav"}'

# Stationary mode — constant hum, hiss, fan noise
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","stationary":true,"output_path":"out/clean.wav"}'

# Partial reduction — subtle noise floor cleanup
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","prop_decrease":0.5,"output_path":"out/clean.wav"}'

# ML (UVR MelBand Roformer, SDR 28) — higher quality, GPU-accelerated
curl -X POST http://localhost:8000/v1/audio/noise-reduce/uvr-denoise \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","output_path":"out/clean.wav"}'

DSP params (only apply to noise-reduce engine): stationary (bool, default false), prop_decrease (0–1, default 1.0). Both engines accept output_format, output_path, output_url.

Time-stretch and pitch-shift

Independent tempo factor and semitone offset via librosa phase vocoder. Slow a track down to learn it; shift a vocal up 3 semitones for a different key; transpose a MIDI melody to a different register first, then render.

# Slow down to 80% speed, no pitch change
curl -X POST http://localhost:8000/v1/audio/stretch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","tempo_factor":0.8,"output_path":"out/slow.wav"}'

# Shift up 3 semitones, no tempo change
curl -X POST http://localhost:8000/v1/audio/stretch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","pitch_semitones":3,"output_path":"out/pitched.wav"}'

# Both — pitch-corrected time stretch (traditional chipmunk effect)
curl -X POST http://localhost:8000/v1/audio/stretch \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","tempo_factor":0.5,"pitch_semitones":6,"output_format":"mp3","output_path":"out/stretched.mp3"}'

Params: tempo_factor (default 1.0 — 0.5 = half speed), pitch_semitones (default 0.0 — ±semitones), output_format, output_path.

Pitch correct

Auto-tune audio toward the nearest chromatic semitone using librosa's phase vocoder. Full strength=1.0 snaps hard to pitch; lower values blend the corrected and original signal.

# Hard auto-tune — snap every note to the nearest semitone
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","output_path":"out/tuned.wav"}'

# Subtle correction — 50% blend
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","strength":0.5,"output_format":"mp3","output_path":"out/tuned.mp3"}'

# Async for long files, staged output
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"sessions/take1.wav","strength":1.0,"async_job":true,"output_path":"sessions/take1_tuned.wav"}'

Params: strength (0.0–1.0, default 1.0), output_format, output_path, async_job, webhook_url. Requires librosa-analyze engine.

Repair

Declip clipped peaks and/or remove power-line hum. Declipping uses cubic interpolation to reconstruct flattened waveform tops and bottoms. Dehumming applies a notch filter at hum_freq (and harmonics).

# Declip only (default)
curl -X POST http://localhost:8000/v1/audio/repair \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/overdriven.wav","output_path":"out/repaired.wav"}'

# Remove 60 Hz hum (North American power grid)
curl -X POST http://localhost:8000/v1/audio/repair \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","declip":false,"dehum":true,"hum_freq":60.0,"output_path":"out/clean.wav"}'

# Both — declip a 50 Hz humming mic recording
curl -X POST http://localhost:8000/v1/audio/repair \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/problem_track.wav","declip":true,"dehum":true,"hum_freq":50.0,"output_format":"flac","output_path":"out/repaired.flac"}'

Params: declip (bool, default true), dehum (bool, default false), hum_freq (Hz, default 50.0), output_format, output_path, async_job, webhook_url.

Audio tagging

Top-K AudioSet class label classification via Audio Spectrogram Transformer (MIT/ast-finetuned-audioset-10-10-0.4593). Identifies what's in a recording — music, speech, specific instruments, environmental sounds, etc.

curl -X POST http://localhost:8000/v1/audio/tag \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav"}'
# → {
#     "tags": [
#       {"label": "Music", "score": 0.94},
#       {"label": "Drum", "score": 0.87},
#       {"label": "Guitar", "score": 0.71},
#       ...
#     ],
#     "duration": 5.2
#   }

# Get top 20 results instead of the default 10
curl -X POST http://localhost:8000/v1/audio/tag \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/soundscape.wav","top_k":20}'

Requires the HF model cache. First run downloads the weights to /data/hf/. Optional: top_k (default 10).

The image defaults to HF_HUB_OFFLINE=0 so first call lazy-downloads the weights into /data/hf/. For locked-down deployments (no egress), prefetch the model with huggingface-cli download <model> into a mounted /data/hf volume, then start the container with -e HF_HUB_OFFLINE=1.

Audio embeddings

512-dimensional L2-normalized audio embeddings via LAION CLAP (laion/larger_clap_music_and_speech). Useful for semantic audio search, similarity scoring, and clustering.

# Get the embedding vector
curl -X POST http://localhost:8000/v1/audio/embed \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'
# → {"embedding": [0.032, -0.11, ...], "dim": 512, "norm": 1.0}

# Semantic similarity — how well does the audio match a text description?
curl -X POST http://localhost:8000/v1/audio/embed \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","query_text":"energetic rock guitar riff"}'
# → {"embedding": [...], "dim": 512, "norm": 1.0,
#    "query_text": "energetic rock guitar riff", "similarity": 0.73}

similarity is cosine similarity in [-1, 1]. Requires HF model cache — same first-run download caveat as audio tagging.

Zero-shot classification

Given audio and a list of free-form text labels, return cosine similarity scores for each using the existing CLAP model. No extra model download — uses the same clap-embed engine. Works for genres, moods, instruments, sonic descriptors — anything CLAP understands.

# Genre detection
curl -X POST http://localhost:8000/v1/audio/classify \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","labels":["jazz","hip-hop","classical","electronic","rock"]}'
# → {"results": [
#     {"label": "hip-hop", "score": 0.42},
#     {"label": "electronic", "score": 0.38},
#     ...
#   ]}

# Mood / energy
curl -X POST http://localhost:8000/v1/audio/classify \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","labels":["energetic","calm","melancholic","aggressive","uplifting"]}'

# Speaker gender
curl -X POST http://localhost:8000/v1/audio/classify \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/interview.wav","labels":["male voice","female voice","child voice","multiple speakers"]}'

Results are sorted by descending score. Scores are cosine similarities in [-1, 1] — higher = more similar. Requires clap-embed model cache.

Audio info

Probe any audio file for metadata without loading it into memory for processing. Uses ffprobe — handles any format.

curl -X POST http://localhost:8000/v1/audio/info \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}'
# → {
#     "size_bytes": 52428800,
#     "duration_sec": 297.241,
#     "sample_rate": 44100,
#     "channels": 2,
#     "codec": "pcm_s16le",
#     "sample_fmt": "s16",
#     "format": "wav",
#     "bit_depth": 16,
#     "bit_rate": 1411200
#   }

# Works on any staged file
curl -X POST http://localhost:8000/v1/audio/info \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"recordings/interview.mp3"}'
# → {"codec": "mp3", "bit_rate": 192000, ...}

Trim

Cut a precise time range out of any audio file. Common use: extract a chorus, clip a sample, chop a stem at bar boundaries.

# Extract seconds 30–90 from a track
curl -X POST http://localhost:8000/v1/audio/trim \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","start_sec":30.0,"end_sec":90.0,"output_path":"out/chorus.wav"}'

# Clip a specific beat range, export as mp3
curl -X POST http://localhost:8000/v1/audio/trim \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/stem.wav","start_sec":0.0,"end_sec":8.0,"output_format":"mp3","output_path":"out/loop.mp3"}'

# From staged file, write to a different staging path
curl -X POST http://localhost:8000/v1/audio/trim \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"sessions/full.wav","start_sec":120.5,"end_sec":180.0,"output_path":"clips/verse.wav"}'

start_sec defaults to 0. end_sec is required and must be greater than start_sec. Supports all standard output_format values.

Mix

Combine multiple staged or URL-accessible tracks into one. Per-track gain_db lets you balance levels before mixing. Useful for bouncing separated stems back together at custom levels, layering synth parts, or combining click-track + music.

# Mix drums and bass at equal levels
curl -X POST http://localhost:8000/v1/audio/mix \
  -H 'Content-Type: application/json' \
  -d '{"tracks":[{"file_path":"stems/drums.wav"},{"file_path":"stems/bass.wav"}],"output_path":"out/rhythm.wav"}'

# Stems at custom levels (drums -3 dB, bass 0 dB, vocals +2 dB)
curl -X POST http://localhost:8000/v1/audio/mix \
  -H 'Content-Type: application/json' \
  -d '{"tracks":[
    {"file_path":"stems/drums.wav","gain_db":-3},
    {"file_path":"stems/bass.wav","gain_db":0},
    {"file_path":"stems/vocals.wav","gain_db":2}
  ],"output_format":"wav","output_path":"out/custom_mix.wav"}'

# Write to a different staging path
curl -X POST http://localhost:8000/v1/audio/mix \
  -H 'Content-Type: application/json' \
  -d '{"tracks":[{"file_path":"stems/harmonic.wav"},{"file_path":"stems/percussive.wav","gain_db":-6}],"output_path":"mixed/recombined.wav"}'

tracks is a required JSON array. Each entry needs file_path or file_url and an optional gain_db (default 0.0). Requires at least 2 tracks. Shorter tracks are padded with silence to match the longest.

Concat

Stitch N audio files together in order. Handles different sample rates and channel counts automatically (ffmpeg resamples on the fly).

curl -X POST http://localhost:8000/v1/audio/concat \
  -H 'Content-Type: application/json' \
  -d '{"files":[{"file_path":"intro.wav"},{"file_path":"verse.wav"},{"file_path":"outro.wav"}],"output_path":"out/full_track.wav"}'

# output_format change + different staging path
curl -X POST http://localhost:8000/v1/audio/concat \
  -H 'Content-Type: application/json' \
  -d '{"files":[{"file_path":"a.wav"},{"file_path":"b.wav"}],"output_format":"mp3","output_path":"concat/result.mp3"}'

files is a required JSON array of {file_path?, file_url?} objects. Requires at least 2 entries.

Speed

Change playback speed without pitch shifting — useful for auditioning at half/double speed, or creating slow-motion effects. Uses ffmpeg atempo filter chained for extreme multipliers.

# Half speed
curl -X POST http://localhost:8000/v1/audio/speed \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","speed":0.5,"output_path":"out/slow.wav"}'

# Double speed
curl -X POST http://localhost:8000/v1/audio/speed \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","speed":2.0,"output_path":"out/fast.wav"}'

# 4× speed (chains two atempo=2.0 filters internally)
curl -X POST http://localhost:8000/v1/audio/speed \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","speed":4.0,"output_format":"mp3","output_path":"out/fast.mp3"}'

speed is required. Range: 0.1–10.0. Note: this changes duration but not pitch. For pitch-preserving tempo changes use /v1/audio/stretch.

Convert

Re-encode audio to a different format, sample rate, or channel count in a single call.

# WAV → 16 kHz mono FLAC (for speech models)
curl -X POST http://localhost:8000/v1/audio/convert \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","output_format":"flac","sample_rate":16000,"channels":1,"output_path":"out/prepared.flac"}'

# Stereo → mono WAV
curl -X POST http://localhost:8000/v1/audio/convert \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stereo.wav","channels":1,"output_path":"out/mono.wav"}'

# Any format → Opus at 48 kHz
curl -X POST http://localhost:8000/v1/audio/convert \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/audio.mp3","output_format":"opus","sample_rate":48000,"output_path":"out/out.opus"}'

output_format defaults to wav. sample_rate and channels are optional; if omitted, the source values are preserved.

Similar

Compute cosine similarity between two audio files using CLAP embeddings. Returns a score in [-1, 1] — 1 = identical sound, 0 = unrelated, negative = acoustically opposite. Useful for duplicate detection, cover matching, or finding the closest sample in a library.

curl -X POST http://localhost:8000/v1/audio/similar \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/original.wav","reference_file_path":"uploads/remix.wav"}'
# → {"similarity": 0.847, "dim": 512}

# Different staged paths
curl -X POST http://localhost:8000/v1/audio/similar \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stems/vocals.wav","reference_file_path":"stems/vocals_ref.wav"}'

Primary file: file_path / file_url. Reference file: reference_file_path / reference_file_url. Requires clap-embed engine.

MIDI quantize

Snap all note timings in a MIDI file to the nearest rhythmic grid. Cleaner dedicated endpoint than /v1/midi/transform's quantize_grid_beats param.

# Quantize to 16th notes (0.25 beats)
curl -X POST http://localhost:8000/v1/midi/quantize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/sloppy.mid","grid_beats":0.25,"output_path":"out/tight.mid"}'

# 8th note grid
curl -X POST http://localhost:8000/v1/midi/quantize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"recorded.mid","grid_beats":0.5,"output_path":"midi/quantized.mid"}'

grid_beats: grid size in beats — 0.25 = 16th note, 0.5 = 8th, 1.0 = quarter note. Default: 0.25.

Fade

Apply fade-in, fade-out, or both. 13 curve shapes: tri, qsin, esin, hsin, log, ipar, qua, cub, squ, cbr, par, exp, lin.

# 2s fade-in
curl -X POST http://localhost:8000/v1/audio/fade \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","fade_in":2.0,"output_path":"out/faded.wav"}'

# 3s fade-out with exponential curve
curl -X POST http://localhost:8000/v1/audio/fade \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","fade_out":3.0,"curve":"exp","output_path":"out/faded.wav"}'

# Both — 1s in, 2s out
curl -X POST http://localhost:8000/v1/audio/fade \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","fade_in":1.0,"fade_out":2.0,"output_path":"out/faded.wav"}'

At least one of fade_in / fade_out must be > 0.

Reverse

Flip audio backwards via ffmpeg areverse.

curl -X POST http://localhost:8000/v1/audio/reverse \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/sample.wav","output_path":"out/reversed.wav"}'

curl -X POST http://localhost:8000/v1/audio/reverse \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stems/vocals.wav","output_format":"mp3","output_path":"out/reversed.mp3"}'

Loop

Repeat audio N times. Uses ffmpeg aloop filter — no re-encoding overhead per iteration.

# Play 4 times total
curl -X POST http://localhost:8000/v1/audio/loop \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/beat.wav","count":4,"output_path":"out/looped.wav"}'

# 8-bar loop → 32 bars
curl -X POST http://localhost:8000/v1/audio/loop \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stems/drums.wav","count":4,"output_path":"loops/drums32.wav"}'

count must be ≥ 2 (total plays, not extra loops).

BPM match

Detect the source BPM via librosa, then time-stretch to the target — no manual math.

# Stretch anything to 128 BPM
curl -X POST http://localhost:8000/v1/audio/bpm-match \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loop.wav","target_bpm":128,"output_path":"out/matched.wav"}'

# Match tempo and also shift pitch
curl -X POST http://localhost:8000/v1/audio/bpm-match \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loop.wav","target_bpm":140,"pitch_semitones":2,"output_path":"out/matched.wav"}'

Response JSON includes source_bpm, target_bpm, and tempo_factor alongside the staged path / url. Requires both librosa-analyze and stretch engines.

Stereo width

Widen or collapse the stereo image via M/S processing. width=0.0 → mono, 1.0 → original, >1.0 → wider. Works on mono input too (upmixes first).

# Widen to 1.5×
curl -X POST http://localhost:8000/v1/audio/stereo-width \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mix.wav","width":1.5,"output_path":"out/wide.wav"}'

# Collapse to mono
curl -X POST http://localhost:8000/v1/audio/stereo-width \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mix.wav","width":0.0,"output_path":"out/mono.wav"}'

# Subtle narrowing for mix bus
curl -X POST http://localhost:8000/v1/audio/stereo-width \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"master/mix.wav","width":0.8,"output_path":"master/narrow.wav"}'

Range: [0.0, 3.0].

Split

Split a file into segments. Two modes: equal (N equal time parts) or silence (split on quiet gaps). Returns a ZIP of numbered files.

# Split into 4 equal parts
curl -X POST http://localhost:8000/v1/audio/split \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","mode":"equal","count":4,"output_path":"out/segments.zip"}'

# Split a DJ mix on silence
curl -X POST http://localhost:8000/v1/audio/split \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/djmix.wav","mode":"silence","threshold_db":-40,"min_duration_sec":1.0,"output_path":"out/tracks.zip"}'

# Split to mp3
curl -X POST http://localhost:8000/v1/audio/split \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/album.flac","mode":"equal","count":10,"output_format":"mp3","output_path":"out/parts.zip"}'

mode=equal requires count >= 2. mode=silence uses threshold_db (default -30) and min_duration_sec (default 0.5); requires the silence-detect engine.

Pan

Position audio in the stereo field. Works on mono and stereo input.

# Hard left
curl -X POST http://localhost:8000/v1/audio/pan \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","position":-1.0,"output_path":"out/left.wav"}'

# Slight right (e.g. guitar in mix)
curl -X POST http://localhost:8000/v1/audio/pan \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stems/guitar.wav","position":0.4,"output_path":"out/guitar_panned.wav"}'

# Center (no-op but valid)
curl -X POST http://localhost:8000/v1/audio/pan \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mono.wav","position":0.0,"output_path":"out/stereo.wav"}'

position: -1.0 = hard left, 0.0 = center, 1.0 = hard right.

EQ

Parametric EQ via ffmpeg equalizer filter. Pass any number of bands — each with a center frequency, gain, and optional bandwidth.

# Low-cut + presence boost
curl -X POST http://localhost:8000/v1/audio/eq \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","bands":[{"freq":100,"gain_db":-6,"width_hz":80},{"freq":3000,"gain_db":3,"width_hz":500}],"output_path":"out/eq.wav"}'

# Single band: cut 60 Hz hum
curl -X POST http://localhost:8000/v1/audio/eq \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/recording.wav","bands":[{"freq":60,"gain_db":-20,"width_hz":30}],"output_path":"out/clean.wav"}'

Each band: freq (Hz, required), gain_db (dB, required, range ±30), width_hz (optional, default 100).

Key match

Detect the source key via CLAP chord analysis, then pitch-shift to a target key — one call instead of two.

# Shift everything to C major
curl -X POST http://localhost:8000/v1/audio/key-match \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loop.wav","target_key":"C","output_path":"out/matched.wav"}'

# Match to F# (response includes source_key + semitones shifted)
curl -X POST http://localhost:8000/v1/audio/key-match \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"stems/melody.wav","target_key":"F#","output_path":"matched/melody_fsharp.wav"}'

target_key: root note, e.g. C, F#, Bb, D#. Mode suffix (major/minor/m) is ignored — only the root matters for pitch. Requires chord-detect and stretch engines.

Sidechain duck

Duck a primary track (music) whenever a trigger track (voice) is loud — the classic voiceover-over-music effect. Pure ffmpeg sidechaincompress, no model required.

# stage music + voice first via PUT /v1/files/...

curl -X POST http://localhost:8000/v1/audio/sidechain-duck \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/music.wav","trigger_file_path":"uploads/voice.wav","threshold_db":-20,"ratio":4,"attack_ms":10,"release_ms":200,"output_path":"out/ducked.wav"}'

# Aggressive duck for podcast-style music bed
curl -X POST http://localhost:8000/v1/audio/sidechain-duck \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"music/bed.wav","trigger_file_path":"voice/narration.wav","threshold_db":-30,"ratio":10,"release_ms":400,"output_path":"final/mix.wav"}'

Primary track is compressed whenever the trigger exceeds threshold_db. ratio sets compression intensity. Files must be the same duration for best results; shorter trigger is padded with silence.

Effects chain

Apply an ordered chain of pedalboard effects — full catalog, you pick the order and params. Different from /v1/audio/master (which runs preset mastering chains).

# Compress, then add reverb, then drop -3 dB
curl -X POST http://localhost:8000/v1/audio/fx \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","effects":[
    {"type":"Compressor","params":{"threshold_db":-18,"ratio":4.0}},
    {"type":"Reverb","params":{"room_size":0.5,"wet_level":0.3}},
    {"type":"Gain","params":{"gain_db":-3.0}}
  ],"output_path":"out/out.wav"}'

Allowed effects: Compressor, Limiter, NoiseGate, Gain, Clipping, Distortion, Bitcrush, Reverb, Chorus, Delay, Phaser, PitchShift, HighShelfFilter, LowShelfFilter, PeakFilter, HighpassFilter, LowpassFilter, LadderFilter, IIRFilter, GSMFullRateCompressor, MP3Compressor, Resample, Invert, Convolution.

VST3 / AudioUnit / external plugins are NOT in the allowlist — they load arbitrary native code.

Loop point

Find the best seamless loop boundary in an audio file — audiolla analyses the beat grid and returns the start and end positions where a loop will repeat without a click or gap.

# Find best loop boundary (default: minimum 4 bars)
curl -X POST http://localhost:8000/v1/audio/loop-point \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/beat.wav"}' | jq '{loop_start_sec, loop_end_sec, bars, score, tempo_bpm}'
# → {"loop_start_sec": 0.0, "loop_end_sec": 7.44, "bars": 4,
#    "score": 0.94, "tempo_bpm": 128.0, "candidates": [...]}

# Require at least 8 bars, return top 3 candidates
curl -X POST http://localhost:8000/v1/audio/loop-point \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/long_track.wav","min_loop_bars":8,"num_candidates":3}'

Response fields: loop_start_sec, loop_end_sec, bars, score (0–1, higher = tighter loop), tempo_bpm, candidates (array of ranked alternatives). Optional params: min_loop_bars (default 4), num_candidates (default 5). Requires librosa-analyze engine.

Compose MIDI

POST a JSON song spec, get Standard MIDI File bytes back. Write the spec by hand, generate it from a tracker / DAW / sequencer, script it out of a Python notebook, or have an LLM produce it — audiolla doesn't care. No AI runs server-side; the spec is the music.

# 4-beat C major arpeggio at 120 BPM, piano + kick drum
curl -X POST http://localhost:8000/v1/midi/compose \
  -H 'Content-Type: application/json' \
  -d '{
    "tempo_bpm": 120,
    "tracks": [
      {"name":"Lead","program":0,"channel":0,"notes":[
        {"pitch":60,"start_beats":0.0,"duration_beats":0.5,"velocity":100},
        {"pitch":64,"start_beats":0.5,"duration_beats":0.5,"velocity":100},
        {"pitch":67,"start_beats":1.0,"duration_beats":0.5,"velocity":100},
        {"pitch":72,"start_beats":1.5,"duration_beats":0.5,"velocity":100}
      ]},
      {"name":"Kick","program":0,"channel":9,"notes":[
        {"pitch":36,"start_beats":0.0,"duration_beats":0.1,"velocity":110},
        {"pitch":36,"start_beats":1.0,"duration_beats":0.1,"velocity":110},
        {"pitch":36,"start_beats":2.0,"duration_beats":0.1,"velocity":110},
        {"pitch":36,"start_beats":3.0,"duration_beats":0.1,"velocity":110}
      ]}
    ],
    "output_path": "midi/song.mid"
  }'
curl -o song.mid http://localhost:8000/v1/files/midi/song.mid

# Use a JSON spec file (must include output_path / output_url in the body)
curl -X POST http://localhost:8000/v1/midi/compose \
  -H 'Content-Type: application/json' \
  -d @spec.json

Spec fields: tempo_bpm (default 120), time_signature (default [4,4]), key_signature (optional, e.g. "C", "Am"), ticks_per_beat (default 480), tracks[].{name, program, channel, volume, pan, notes[].{pitch, start_beats, duration_beats, velocity}}. Time is in beats. program is GM program 0-127. Channel 9 is the GM drum channel — pitches there map to the drum kit (36 = kick, 38 = snare, 42 = closed hi-hat, etc.).

Inspect MIDI

# read the structure of any Standard MIDI File
curl -X POST http://localhost:8000/v1/midi/inspect \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid"}'
# → {type, ticks_per_beat, tempo_changes, time_signatures,
#    tracks[{name, note_on_count, channels, programs, length_beats}], ...}

Transform MIDI

# transpose all non-drum tracks up an octave
curl -X POST http://localhost:8000/v1/midi/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","transpose_semitones":12,"output_path":"midi/transposed.mid"}'

# override tempo to 140 BPM and save to staging
curl -X POST http://localhost:8000/v1/midi/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","tempo_bpm":140,"output_path":"midi/fast.mid"}'

# drop the drum track (channel 9)
curl -X POST http://localhost:8000/v1/midi/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","drop_channels":[9],"output_path":"midi/no-drums.mid"}'

# keep only channels 0 and 1
curl -X POST http://localhost:8000/v1/midi/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","keep_channels":[0,1],"output_path":"midi/two-ch.mid"}'

# quantize to 1/16th notes
curl -X POST http://localhost:8000/v1/midi/transform \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","quantize_grid_beats":0.25,"output_path":"midi/quantized.mid"}'

transpose_semitones ±48. quantize_grid_beats is in beats (0.25 = 1/16th at 4/4). keep_channels and drop_channels take a JSON array of channel numbers; only one can be set per request.

Render MIDI to audio

# Synthesise via the bundled FluidR3_GM SoundFont
curl -X POST http://localhost:8000/v1/midi/render \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","output_format":"wav","output_path":"out/song.wav"}'
curl -o song.wav http://localhost:8000/v1/files/out/song.wav

# Use your own SoundFont (must be staged first)
curl -X PUT --data-binary @my.sf2 \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/sf/orchestral.sf2
curl -X POST http://localhost:8000/v1/midi/render \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/song.mid","soundfont_path":"sf/orchestral.sf2","output_format":"flac","output_path":"out/orch.flac"}'

Generate music from a spec

Compose + render in one call — spec in, audio file staged.

# spec.json must include "output_path" or "output_url" alongside the composition fields
curl -X POST http://localhost:8000/v1/midi/generate \
  -H 'Content-Type: application/json' \
  -d @spec.json
curl -o song.wav http://localhost:8000/v1/files/out/song.wav

Drum pattern

Step-sequencer spec → GM drum MIDI. Define a rhythmic pattern as arrays of 0/1 step values for each drum voice; the server maps them to GM channel 9 pitches and bakes a MIDI file. Optional swing shifts even-numbered 16th steps for a shuffled feel.

# 4-on-the-floor kick, snare on 2&4, busy hi-hat — 2 bars at 120 BPM
curl -X POST http://localhost:8000/v1/midi/drum \
  -H "Content-Type: application/json" \
  -d '{
    "tempo_bpm": 120,
    "steps": 16,
    "bars": 2,
    "swing": 0.0,
    "pattern": {
      "kick":  [1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0],
      "snare": [0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0],
      "hihat": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
    },
    "output_path": "midi/beat.mid"
  }'
curl -o beat.mid http://localhost:8000/v1/files/midi/beat.mid

# Swing groove — 0.1 = subtle, 0.5 = strong shuffle
curl -X POST http://localhost:8000/v1/midi/drum \
  -H "Content-Type: application/json" \
  -d '{
    "tempo_bpm": 95,
    "steps": 16,
    "bars": 1,
    "swing": 0.2,
    "pattern": {
      "kick":  [1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0],
      "snare": [0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0],
      "hihat": [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
    },
    "output_path": "midi/groove.mid"
  }'

Body fields: tempo_bpm (default 120), steps (steps per bar, default 16), bars (default 1), swing (0.0–0.5, default 0.0), pattern (object — keys are drum voice names, values are arrays of 0/1). Supported voices: kick, snare, hihat, open_hihat, ride, crash, clap, tom_hi, tom_mid, tom_low, rim, cowbell. Requires midi-compose engine.

Chords to MIDI

Detect the chord progression from an audio file and convert each segment to a MIDI chord (root + 3rd + 5th). Useful for exporting a detected chord chart as playable MIDI, re-harmonising an arrangement, or seeding a DAW session.

# Audio → chord MIDI at the detected tempo
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav","output_path":"out/chords.mid"}'

# Override tempo, set velocity and octave
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/song.wav","tempo_bpm":120,"velocity":90,"octave":3,"output_path":"out/chords.mid"}'

# Stage the output under a different path
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"sessions/song.wav","output_path":"midi/song_chords.mid"}'

Optional params: tempo_bpm (default: detected from audio), velocity (1–127, default 80), octave (0–8, default 4), output_path. Requires chord-detect engine. Each chord segment becomes a MIDI chord event (root + major 3rd/minor 3rd + perfect 5th, duration = segment length).

Audio metadata tags

Read and write ID3 (MP3), Vorbis (OGG/FLAC), and WAV/M4A tags via mutagen. Requires the metadata engine.

# Read tags
curl -X POST http://localhost:8000/v1/audio/metadata \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.mp3"}' | jq '{title, artist, bpm, key, duration_sec}'

# Write tags — returns updated tag set
curl -X POST http://localhost:8000/v1/audio/metadata \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.mp3","tags":{"title":"My Track","artist":"DJ Audiolla","bpm":"128","year":"2026"}}'

Clip detection

Detect digital clipping. No engine required — pure numpy arithmetic.

curl -X POST http://localhost:8000/v1/audio/clip-detect \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loud_master.wav"}' | jq '{clipped, clip_count, clip_ratio, peak_db}'
# → {"clipped":true,"clip_count":4219,"clip_ratio":0.0048,"peak_db":0.0}

Mid/Side encode and decode

Encode L/R stereo to Mid+Side or decode back. Useful for stereo width surgery without touching the pedalboard chain.

# Encode L/R → M/S
curl -X POST http://localhost:8000/v1/audio/mid-side \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/stereo.wav","mode":"encode","output_path":"out/ms_encoded.wav"}'

# Decode back to L/R
curl -X POST http://localhost:8000/v1/audio/mid-side \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"out/ms_encoded.wav","mode":"decode","output_path":"out/restored.wav"}'

Beat slice

Detect beat positions with librosa and return a ZIP of numbered WAV/MP3 slices — one file per beat interval.

curl -X POST http://localhost:8000/v1/audio/beat-slice \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loop.wav","output_format":"wav","output_path":"out/slices.zip"}'
curl -o slices.zip http://localhost:8000/v1/files/out/slices.zip
# → slices.zip: beat_001.wav, beat_002.wav, beat_003.wav …

# Stage the ZIP at a different path
curl -X POST http://localhost:8000/v1/audio/beat-slice \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/loop.wav","output_path":"beats/loop_slices.zip"}'
# → {"path":"beats/loop_slices.zip","beat_count":32,...}

Convolution reverb

Apply an impulse response (IR) to audio via pedalboard's Convolution. Any WAV file can be used as the IR.

# Upload your IR first
curl -X PUT --data-binary @plate_reverb.wav \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/ir/plate.wav

# Apply — wet_mix: 0.0=dry only, 1.0=wet only
curl -X POST http://localhost:8000/v1/audio/conv-reverb \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/dry_vocal.wav","ir_file_path":"ir/plate.wav","wet_mix":0.25,"output_format":"wav","output_path":"out/reverbed.wav"}'

Transient shaper

Attack/sustain dual-compressor blending. Positive attack_gain_db makes drums punchier; negative sustain_gain_db cuts room tail.

# Punchy drums: boost attack, cut sustain
curl -X POST http://localhost:8000/v1/audio/transient \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/drums.wav","attack_gain_db":6,"sustain_gain_db":-4,"output_path":"out/punchy_drums.wav"}'

# Soft attack (pad-like)
curl -X POST http://localhost:8000/v1/audio/transient \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/synth.wav","attack_gain_db":-6,"sustain_gain_db":0,"output_path":"out/softened.wav"}'

Multiband compression

Split the signal into N+1 frequency bands and compress each one independently. Bands are split with zero-phase LR4-equivalent crossovers, so a bypassed chain reconstructs the original. Mastering-engineer staple — tame bass thump without squashing vocal sibilance, level out a busy mid-range, etc.

# 3-band mastering pass: low/mid/high
curl -X POST http://localhost:8000/v1/audio/multiband-compress \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mixdown.wav","crossovers_hz":[200,3000],"bands":[
    {"threshold_db":-18,"ratio":4,"attack_ms":15,"release_ms":150,"makeup_db":1.5},
    {"threshold_db":-14,"ratio":3,"attack_ms":8, "release_ms":80, "makeup_db":1.0},
    {"threshold_db":-10,"ratio":2,"attack_ms":3, "release_ms":40, "makeup_db":0.5}
  ],"output_path":"out/mastered.wav"}'

crossovers_hz length is N, bands length is N+1. Each band: required threshold_db + ratio, optional attack_ms (default 10), release_ms (default 100), makeup_db (default 0).

DJ prep

One call returns everything a DJ needs about a track. Requires librosa-analyze + chord-detect. LUFS is reported when a loudness engine is available.

curl -X POST http://localhost:8000/v1/audio/dj-prep \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/track.wav"}' | jq .
# → {"bpm":128.0,"key":"A minor","camelot":"8A","integrated_lufs":-9.4}

Camelot wheel positions let you quickly find harmonically compatible tracks for mixing.

De-ess

Split-band high-frequency de-esser — attenuates sibilance above frequency_hz without affecting the rest of the signal. Implemented with a Butterworth HPF, envelope follower, and per-channel gain reduction. No engine required.

# Default settings (threshold -20 dB, 6 kHz, 4:1 ratio)
curl -X POST http://localhost:8000/v1/audio/deess \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","output_path":"out/deessed.wav"}'

# Gentle pass on a mix
curl -X POST http://localhost:8000/v1/audio/deess \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mix.wav","threshold_db":-15,"frequency_hz":7000,"ratio":2.5,"output_path":"out/mix_deessed.wav"}'

# Stage output under a different path
curl -X POST http://localhost:8000/v1/audio/deess \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/vocal.wav","output_path":"sessions/vocal_deessed.wav"}'
# → {"path":"sessions/vocal_deessed.wav","threshold_db":-20.0,"frequency_hz":6000.0,"ratio":4.0,...}

Optional params: threshold_db (≤ 0, default -20), frequency_hz (2000–15000, default 6000), ratio (1.0–20.0, default 4.0), output_format (wav/mp3/flac…), output_path.

Stereo field analysis

Measure stereo width, phase correlation, mid/side balance, and mono compatibility. No engine required — pure numpy.

curl -X POST http://localhost:8000/v1/audio/stereo-field \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/stereo_mix.wav"}' | jq .
# → {
#     "correlation": 0.72,       # Pearson L/R correlation [-1,1]
#     "width": 0.41,             # side_rms / mid_rms
#     "balance_db": -0.3,        # L vs R level difference
#     "mono_compatible": true,   # correlation >= 0.5
#     "mid_level_db": -12.1,
#     "side_level_db": -18.4,
#     "phase_issues": false,
#     "channels": 2,
#     "sample_rate": 44100,
#     "duration": 210.5
#   }

# Analyze a different staged file
curl -X POST http://localhost:8000/v1/audio/stereo-field \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"masters/track.wav"}' | jq '{correlation, width, mono_compatible}'

Mono files return correlation=1.0, width=0.0, mono_compatible=true. Use correlation < 0 as a red flag for phase-cancelled material that will collapse on mono playback.

Audio thumbnail

Extract the most energetic segment of an audio file — the passage with the highest onset density in a given window. Useful for generating preview clips, podcast teasers, or DJ cue points. Requires librosa-analyze.

# Default 30-second thumbnail
curl -X POST http://localhost:8000/v1/audio/thumbnail \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/long_track.wav","output_path":"out/preview.wav"}'

# 10-second teaser
curl -X POST http://localhost:8000/v1/audio/thumbnail \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/podcast.wav","duration_sec":10,"output_format":"mp3","output_path":"out/teaser.mp3"}'

# Stage + get timestamps
curl -X POST http://localhost:8000/v1/audio/thumbnail \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/album_track.wav","duration_sec":20,"output_path":"previews/track_thumb.wav"}'
# → {"path":"previews/track_thumb.wav","start_sec":47.3,"end_sec":67.3,"duration_sec":20.0,...}

Optional params: duration_sec (1–300, default 30), output_format, output_path. When output_path is set the response JSON includes start_sec and end_sec so you know exactly where in the source the thumbnail was extracted.

MIDI humanize

Add subtle timing and velocity variations to a MIDI file to make it sound less mechanical. Jitter is uniformly distributed and, when a seed is provided, fully deterministic. Requires midi-compose.

# Gentle humanize with defaults (±10 ms timing, ±10% velocity)
curl -X POST http://localhost:8000/v1/midi/humanize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/rigid.mid","output_path":"midi/human.mid"}'

# Heavier feel with a fixed seed for reproducible results
curl -X POST http://localhost:8000/v1/midi/humanize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/drums.mid","timing_ms":20,"velocity_pct":15,"seed":42,"output_path":"midi/drums_human.mid"}'

# Stage output under a different path
curl -X POST http://localhost:8000/v1/midi/humanize \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"midi/pattern.mid","timing_ms":8,"output_path":"midi/pattern_human.mid"}'
# → {"path":"midi/pattern_human.mid","timing_ms":8.0,"velocity_pct":10.0,...}

Optional params: timing_ms (0–500, default 10), velocity_pct (0–50, default 10), seed (any int, optional), output_path. Non-MIDI input returns 400. Requires midi-compose.

Batch operations

Run multiple operations on staged files in one HTTP call. Operations run sequentially; each gets an independent result entry even if earlier ops fail.

Supported ops: convert, normalize, trim, fade, reverse, speed, eq.

# Stage input
curl -X PUT http://localhost:8000/v1/files/work/track.wav --data-binary @track.wav

# Batch: trim, convert to MP3, reverse in one call
curl -X POST http://localhost:8000/v1/batch \
  -H "Content-Type: application/json" \
  -d '[
    {"op":"trim","file_path":"work/track.wav","output_path":"work/chorus.wav","start_sec":30,"end_sec":60},
    {"op":"convert","file_path":"work/track.wav","output_path":"work/track.mp3","output_format":"mp3"},
    {"op":"reverse","file_path":"work/track.wav","output_path":"work/reversed.wav"}
  ]' | jq '.results[].status'
# → "ok" "ok" "ok"

Async jobs and webhooks

Every audio endpoint accepts async_job=true — the request returns immediately with a job ID and the work happens in the background. Poll for status or register a webhook.

# Pre-stage input (one-time)
curl -X PUT --data-binary @track.wav \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/track.wav

# Submit async with staging path — result written to /v1/files/stems/...
curl -X POST http://localhost:8000/v1/audio/separate \
  -H 'Content-Type: application/json' \
  -d '{
    "file_path":"uploads/track.wav",
    "engine":"htdemucs",
    "stems":["vocals"],
    "async_job":true,
    "webhook_url":"https://my-server.com/hooks/audio",
    "output_path":"stems/track-vocals.wav"
  }'
# → {"job_id":"abc123","status":"pending","status_url":"/v1/jobs/abc123"}

# Submit async with presigned S3 PUT URL — result uploaded on completion
curl -X POST http://localhost:8000/v1/audio/master \
  -H 'Content-Type: application/json' \
  -d '{
    "file_path":"uploads/track.wav",
    "mode":"chain",
    "preset":"transparent",
    "async_job":true,
    "output_url":"https://bucket.s3.amazonaws.com/result.wav?X-Amz-..."
  }'
# → {"job_id":"def456","status":"pending","status_url":"/v1/jobs/def456"}

# Poll
curl http://localhost:8000/v1/jobs/abc123 | jq '{status, duration_sec, result}'

# List all jobs (optional ?status=pending|running|completed|failed|cancelled)
curl http://localhost:8000/v1/jobs

# Cancel a running job
curl -X DELETE http://localhost:8000/v1/jobs/abc123

Webhook payload (POST to your URL when the job completes):

{
  "id": "abc123",
  "endpoint": "/v1/audio/separate",
  "status": "completed",
  "duration_sec": 12.4,
  "result": {"path": "stems/track-vocals.wav", "size": 3145728, ...}
}

Delivery has 4 attempts with exponential backoff (0 s, 1 s, 2 s, 4 s). Completed jobs stay in memory for AUDIOLLA_JOB_TTL seconds (default 1 hour) then are swept.

Stage files

A simple server-side file store under /v1/files. Upload, list, download, delete.

# upload
curl -X PUT http://localhost:8000/v1/files/mytrack.wav \
  --data-binary @track.wav

# list
curl http://localhost:8000/v1/files

# download
curl http://localhost:8000/v1/files/mytrack.wav -o copy.wav

# delete
curl -X DELETE http://localhost:8000/v1/files/mytrack.wav

Once staged, reference the file by path on any audio endpoint via file_path:

# Analyze a staged file
curl -X POST http://localhost:8000/v1/audio/analyze \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"mytrack.wav","features":["bpm"]}'

# Separate stems and write the result back to staging
curl -X POST http://localhost:8000/v1/audio/separate \
  -H 'Content-Type: application/json' \
  -d '{
    "file_path":"mytrack.wav",
    "engine":"htdemucs",
    "stems":["vocals"],
    "output_path":"stems/mytrack-vocals.wav"
  }'
# → {"path":"stems/mytrack-vocals.wav","size":...,"output_format":"wav",...}

Remote URLs

Disabled by default. To allow the server to fetch file_url or PUT to output_url, set the policy at container start:

docker run ... \
  -e AUDIOLLA_FETCH_MODE=allowlist \
  -e AUDIOLLA_FETCH_HOSTS="*.s3.amazonaws.com,*.r2.cloudflarestorage.com" \
  psyb0t/audiolla:latest

Then:

# Fetch from S3, master, PUT result back to a presigned S3 URL
curl -X POST http://localhost:8000/v1/audio/master \
  -H 'Content-Type: application/json' \
  -d '{
    "file_url":"https://my-bucket.s3.amazonaws.com/in.wav",
    "reference_url":"https://my-bucket.s3.amazonaws.com/ref.wav",
    "mode":"reference",
    "output_url":"https://my-bucket.s3.amazonaws.com/out.wav?X-Amz-Signature=..."
  }'
# → {"url":"...","size":...,"output_format":"wav",...}

Policy modes:

disabled (default) — file_url / output_url rejected with 400
allowlist — only hosts matching AUDIOLLA_FETCH_HOSTS allowed
denylist — anything except listed hosts allowed (pair with AUDIOLLA_FETCH_ALLOW_PRIVATE=false to block private IPs / metadata services)

Always-on protections:

DNS-resolved private / loopback / link-local IPs rejected (toggleable)
Only https by default; http opt-in via AUDIOLLA_FETCH_SCHEMES
Redirects re-validated through the same policy
Hard timeout + size cap = AUDIOLLA_MAX_UPLOAD_BYTES
Every fetch / upload URL logged

See Configuration for all AUDIOLLA_FETCH_* env vars.

Engines

Slug	What it does
`htdemucs`	4-stem separation: drums, bass, other, vocals. Best speed/quality tradeoff.
`htdemucs_ft`	Same 4 stems, fine-tuned weights. Higher quality, ~4x slower. CUDA-only — rejected with 400 on the CPU image.
`htdemucs_6s`	6 stems — also splits guitar and piano. Experimental.
`mdx_extra`	Strong on vocal isolation. MUSDB-trained, different architecture.
`matchering`	Reference-based mastering: EQ + loudness matched to a reference track.
`pedalboard-chain`	Preset mastering chains via pedalboard — `transparent` (light) or `loud` (4:1 squash). Backs `/v1/audio/master` with `mode=chain`. For arbitrary chains use `fx-chain` / `/v1/audio/fx`.
`librosa-analyze`	BPM, key, LUFS, duration, spectral features, beat grid, onset detection, melody (pyin), structural segmentation via librosa.
`sox-transform`	Gain, EQ, compression, reverb, pitch shift, tempo via pysox.
`fx-chain`	Arbitrary pedalboard effects chain — full catalog, your order and params. Backs `/v1/audio/fx`.
`midi-compose`	JSON spec → MIDI bytes. Also inspects and transforms existing MIDI files. Backs `/v1/midi/{compose,inspect,transform,generate}`.
`midi-render`	MIDI → audio via fluidsynth + SoundFont. Backs `/v1/midi/render` and `/v1/midi/generate`.
`silence-detect`	Locate silent gaps via ffmpeg `silencedetect`. Optional auto-trim. Backs `/v1/audio/silence`.
`ffmpeg-render`	Static PNG spectrogram/waveform + 8-mode animated MP4/WebM video via ffmpeg filters. Backs `/v1/audio/visualize/image/*` and `/v1/audio/visualize/video/{mode}`.
`audio-fingerprint`	Chromaprint acoustic fingerprint via `fpcalc`. Backs `/v1/audio/fingerprint`.
`uvr-dereverb`	BS-Roformer de-reverb — removes room reverb; `primary_stem=No Reverb`.
`uvr-deecho`	VR Architecture de-echo — normal and aggressive modes; pass `aggressive=true` for harder suppression.
`uvr-denoise`	MelBand Roformer de-noise (SDR 28) — removes broadband background noise.
`uvr-karaoke`	MelBand Roformer karaoke — remove lead vocals, keep backing; works via `/v1/audio/separate`.
`uvr-vocal-bsr`	BS-Roformer vocal/instrumental (SDR 13) — highest-quality vocal separation; works via `/v1/audio/separate`.
`basic-pitch`	Polyphonic audio-to-MIDI via Spotify basic-pitch (ONNX backend). Backs `/v1/audio/to_midi`.
`deepfilter`	Neural speech and vocal enhancement via DeepFilterNet DF3. Backs `/v1/audio/enhance`.
`chord-detect`	Chord and key detection via librosa — Krumhansl-Schmuckler key estimation + chroma template chord segmentation. Backs `/v1/audio/chords`.
`silero-vad`	Voice activity detection via silero-vad (ONNX) — returns speech/non-speech segments with timestamps and speech ratio. Backs `/v1/audio/vad`.
`pyannote`	Speaker diarization via pyannote/speaker-diarization-3.1 — returns per-speaker timestamped segments. Requires `HUGGINGFACE_TOKEN`. Backs `/v1/audio/diarize`.
`stretch`	Time-stretch + pitch-shift via librosa phase vocoder — independent tempo factor and semitone offset. Backs `/v1/audio/stretch`.
`ast-tag`	Audio tagging via Audio Spectrogram Transformer (MIT/ast-finetuned-audioset-10-10-0.4593) — top-K AudioSet class labels. Requires HF model cache. Backs `/v1/audio/tag`.
`clap-embed`	512-dim L2-normalized audio embeddings via LAION CLAP (laion/larger_clap_music_and_speech) — semantic audio search. Requires HF model cache. Backs `/v1/audio/embed`.
`hpss`	Harmonic/percussive source separation via librosa HPSS median filter — returns harmonic + percussive stems as a ZIP. Backs `/v1/audio/separate/hpss`.
`noise-reduce`	Spectral noise reduction via noisereduce — stationary (constant hum/hiss) and non-stationary (adaptive) modes, no GPU required. Backs `/v1/audio/noise-reduce/noise-reduce`.
`metadata`	Read/write audio tags (ID3 for MP3, Vorbis for OGG/FLAC, INFO for WAV, MP4 for M4A) via mutagen. No ML weights. Backs `/v1/audio/metadata`.
`stable-audio-open`	Text-to-audio — Stability Stable Audio Open 1.0. Stability Community Licence (commercial use OK below the revenue threshold; read the license). 47-second hard cap; best for loops, riffs, ambient textures, SFX, drum beats. No vocals. ~12 GB VRAM at fp16 — CUDA-only. Backs `/v1/audio/generate/stable-audio-open`.
`musicgen-small`	Text-to-music — Meta MusicGen 300M. CC-BY-NC 4.0 (non-commercial only; opt-in via `AUDIOLLA_ENABLE_NONCOMMERCIAL=1` in the server env). 30 s hard cap; instrumental only. ~3 GB VRAM at fp16 — CUDA-only. Backs `/v1/audio/generate/musicgen-small`.
`musicgen-medium`	Text-to-music — Meta MusicGen 1.5B. CC-BY-NC 4.0 (same opt-in). 30 s hard cap; higher quality than -small. ~6-8 GB VRAM at fp16 — CUDA-only. Backs `/v1/audio/generate/musicgen-medium`.
`riffusion`	Text-to-music — Riffusion-v1, a Stable Diffusion fine-tune that generates spectrograms (converted to audio via Griffin-Lim). CreativeML OpenRAIL-M (commercial use OK with the licence's usage restrictions). ~5 s per pass, lo-fi character, 22.05 kHz mono. ~3 GB VRAM at fp16 — CUDA-only. Backs `/v1/audio/generate/riffusion`.
`audioldm2`	Text-to-audio / SFX — AudioLDM 2 (cvssp/audioldm2). CC-BY 4.0 (commercial use OK — no opt-in gate, the only commercial-safe generator in this set). General-purpose SFX: environmental ambience, animal sounds, foley, mechanical / impact sounds. 16 kHz mono, up to 30 s. Slow (200-step DDIM by default — pass `num_inference_steps=50` to trade quality for ~4x speed). ~8-10 GB VRAM at fp16 with CPU offload. CUDA-only. Backs `/v1/audio/generate/audioldm2`.

Each Demucs variant is its own checkpoint (hosted on dl.fbaipublicfiles.com). The entrypoint prefetches every enabled variant into /data/torch_cache/ at startup so the first separation request doesn't sit there downloading.

AUDIOLLA_ENABLED_ENGINES — restrict which engines are available. AUDIOLLA_PRELOAD — load specific engines into memory at startup instead of waiting for the first request.

Workflows — presets + pipeline

Two ways to chain operations server-side without re-uploading the audio between calls:

Curated presets — server-side YAML workflows shipped in presets/. Run one with a single POST:

# Pre-stage input
curl -X PUT --data-binary @mix.wav \
  -H 'Content-Type: application/octet-stream' \
  http://localhost:8000/v1/files/uploads/mix.wav

# Master a mix for Spotify (-14 LUFS) — multiband compress + normalise
curl -X POST http://localhost:8000/v1/presets/master-for-spotify \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"uploads/mix.wav","output_path":"out/mastered.wav"}'
# → {"path":"out/mastered.wav","size":...,"steps":[...]}
curl -o mastered.wav http://localhost:8000/v1/files/out/mastered.wav

# List available presets
curl http://localhost:8000/v1/presets | jq '.data[] | {name, description}'

# Inspect a preset's steps before running
curl http://localhost:8000/v1/presets/podcast-cleanup | jq '.steps'

Shipped presets: master-for-spotify (3-band master + -14 LUFS), podcast-cleanup (DeepFilterNet + de-ess + -16 LUFS), vocal-cleanup (UVR dereverb + denoise + de-ess + light comp). Add your own as a YAML file in presets/.

Ad-hoc pipeline — chain any registered ops in a single call:

# Restore + multiband + normalise in one request — intermediates stay
# server-side, no re-upload between steps.
curl -X POST http://localhost:8000/v1/pipeline \
  -H 'Content-Type: application/json' \
  -d '{
    "file_path":"uploads/track.wav",
    "output_path":"out/pipelined.wav",
    "steps":[
      {"op":"restore","params":{"engine":"uvr-denoise"}},
      {"op":"multiband_compress","params":{
        "crossovers_hz":[200,3000],
        "bands":[
          {"threshold_db":-18,"ratio":3},
          {"threshold_db":-14,"ratio":2.5},
          {"threshold_db":-10,"ratio":2}
        ]
      }},
      {"op":"normalize","params":{"target_lufs":-14}}
    ]
  }'
# → {"path":"out/pipelined.wav","size":...,"steps":[...]}

# Discover available ops
curl http://localhost:8000/v1/ops | jq .

The response of pipeline + preset endpoints includes a steps log so you can audit what ran. Both endpoints support async_job=true, output_path, output_url like every other audio-producing endpoint.

API catalog

GET /v1/catalog returns the machine-readable list of every endpoint grouped by category (separation, restoration, dynamics, eq-spatial, mastering, time-pitch, editing, analysis, effects-creative, visualize, midi, metadata, workflow, speech, files, jobs, management). Use it for discovery; LLM agents and codegen scripts both consume it.

curl http://localhost:8000/v1/catalog | jq '.categories[] | {name, endpoint_count: (.endpoints | length)}'

Endpoints

Full wire contract: openapi.yaml.

Audio processing

Every endpoint takes a JSON body. Inputs pick exactly one of file_path (pre-staged file under FILES_DIR) xor file_url (HTTPS URL the server fetches). Audio-producing endpoints additionally require exactly one of output_path (server writes the result under FILES_DIR) xor output_url (presigned PUT — server uploads the encoded bytes). Both missing → 400; both set → 400. Responses are always JSON — no raw audio bytes, no Content-Disposition: attachment, no *_base64 fields.

Method	Path	Default returns
`POST`	`/v1/audio/separate`	JSON `{path\|url, size, ...}` — one stem; multi-stem (or all) returns ZIP stream of stems via `output_path`/`output_url`
`POST`	`/v1/audio/master`	JSON `{path\|url, size, output_format, ...}`
`POST`	`/v1/audio/analyze`	JSON — BPM, key, LUFS, spectral features
`POST`	`/v1/audio/beats`	JSON — BPM + beat timestamps; optional click-track WAV
`POST`	`/v1/audio/onsets`	JSON — onset timestamps
`POST`	`/v1/audio/melody`	JSON — dominant melody contour; optional MIDI export
`POST`	`/v1/audio/segments`	JSON — structural segment labels (A, B, C…)
`POST`	`/v1/audio/silence`	JSON — silent/non-silent ranges; optional trimmed audio
`POST`	`/v1/audio/visualize/image/spectrogram`	JSON `{path\|url, size, ...}` — static PNG spectrogram (`color`, `scale` params)
`POST`	`/v1/audio/visualize/image/waveform`	JSON `{path\|url, size, ...}` — static PNG waveform (`color` param)
`POST`	`/v1/audio/visualize/video/{mode}`	JSON `{path\|url, size, ...}` — animated MP4/WebM video (8 modes: `spectrum`, `waves`, `cqt`, …)
`POST`	`/v1/audio/fingerprint`	JSON — Chromaprint fingerprint string
`POST`	`/v1/audio/restore/{engine}`	JSON `{path\|url, size, output_format, ...}` — reverb/echo/noise removed; `aggressive=true` for uvr-deecho hard mode
`POST`	`/v1/audio/to_midi/{engine}`	JSON `{path\|url, size, ...}` — polyphonic transcription (MIDI)
`POST`	`/v1/audio/enhance/{engine}`	JSON `{path\|url, size, output_format, ...}` — neural speech/vocal enhancement
`POST`	`/v1/audio/generate/{engine}`	JSON `{path\|url, size, output_format, ...}` — text-to-audio (engine = `stable-audio-open` / `musicgen-small` / `musicgen-medium` / `riffusion` / `audioldm2`); `prompt` required, optional `duration_sec` / `seed` / `lyrics` / `num_inference_steps`
`POST`	`/v1/audio/chords`	JSON — detected key and chord progression
`POST`	`/v1/audio/vad`	JSON — speech/non-speech segments with timestamps and speech ratio
`POST`	`/v1/audio/diarize/{engine}`	JSON — per-speaker timestamped segments
`POST`	`/v1/audio/transform`	JSON `{path\|url, size, output_format, ...}`
`POST`	`/v1/audio/loudness`	JSON — `{loudness_lufs}` (measure only, no audio)
`POST`	`/v1/audio/loudness/curve`	JSON — `{curve:[{time_sec,rms_db}],duration,sample_rate,points}`; `hop_length` param
`POST`	`/v1/audio/normalize`	JSON `{path\|url, size, measured_lufs, ...}` — requires `target_lufs`; pre-normalization LUFS reported in `measured_lufs` field
`POST`	`/v1/audio/separate/hpss`	JSON `{path\|url, size, ...}` — ZIP stream containing `harmonic.<fmt>` + `percussive.<fmt>`
`POST`	`/v1/audio/noise-reduce/{engine}`	JSON `{path\|url, size, output_format, ...}` — `engine=noise-reduce` (DSP, `stationary`/`prop_decrease`) or `uvr-denoise` (ML)
`POST`	`/v1/audio/stretch`	JSON `{path\|url, size, output_format, ...}`
`POST`	`/v1/audio/pitch-correct`	JSON `{path\|url, size, output_format, ...}` — `strength` [0.0–1.0]; requires `librosa-analyze`
`POST`	`/v1/audio/repair`	JSON `{path\|url, size, output_format, ...}` — `declip` bool, `dehum` bool, `hum_freq` Hz
`POST`	`/v1/audio/tag`	JSON — top-K AudioSet labels with confidence scores
`POST`	`/v1/audio/embed`	JSON — 512-dim embedding; with `query_text` also returns cosine similarity
`POST`	`/v1/audio/classify`	JSON — `{results: [{label, score}]}` sorted descending; requires `clap-embed`
`POST`	`/v1/audio/info`	JSON — duration, sample_rate, channels, codec, bit_depth, format
`POST`	`/v1/audio/trim`	JSON `{path\|url, size, output_format, ...}` — `start_sec` + `end_sec` required
`POST`	`/v1/audio/mix`	JSON `{path\|url, size, output_format, ...}` — `tracks` JSON array required (≥2 entries)
`POST`	`/v1/audio/concat`	JSON `{path\|url, size, output_format, ...}` — `files` JSON array required (≥2 entries)
`POST`	`/v1/audio/speed`	JSON `{path\|url, size, output_format, ...}` — `speed` float required (0.1–10.0)
`POST`	`/v1/audio/convert`	JSON `{path\|url, size, output_format, ...}` — format/sample_rate/channels conversion
`POST`	`/v1/audio/similar`	JSON — `{similarity, dim}`; requires `clap-embed`
`POST`	`/v1/audio/fade`	JSON `{path\|url, size, output_format, ...}` — `fade_in`/`fade_out` seconds, 13 `curve` options
`POST`	`/v1/audio/reverse`	JSON `{path\|url, size, output_format, ...}` — flips playback direction
`POST`	`/v1/audio/loop`	JSON `{path\|url, size, output_format, ...}` — `count` total plays (≥2)
`POST`	`/v1/audio/bpm-match`	JSON `{path\|url, size, output_format, ...}` — `target_bpm` required; requires `librosa-analyze` + `stretch`
`POST`	`/v1/audio/stereo-width`	JSON `{path\|url, size, output_format, ...}` — `width` [0.0–3.0]; M/S stereo processing
`POST`	`/v1/audio/split`	JSON `{path\|url, size, ...}` — ZIP stream; `mode=equal` (requires `count`) or `mode=silence`
`POST`	`/v1/audio/pan`	JSON `{path\|url, size, output_format, ...}` — `position` [-1.0–1.0]
`POST`	`/v1/audio/eq`	JSON `{path\|url, size, output_format, ...}` — `bands` JSON array of `{freq, gain_db, width_hz}`
`POST`	`/v1/audio/key-match`	JSON `{path\|url, size, output_format, ...}` — `target_key` required; requires `chord-detect` + `stretch`
`POST`	`/v1/audio/sidechain-duck`	JSON `{path\|url, size, output_format, ...}` — primary + `trigger_file_*`; ffmpeg sidechaincompress
`POST`	`/v1/audio/fx`	JSON `{path\|url, size, output_format, ...}`
`POST`	`/v1/audio/metadata`	JSON — tag fields (title, artist, bpm, key, duration, sample_rate…); writes tags when `tags` JSON is provided
`POST`	`/v1/audio/clip-detect`	JSON — clipped, clip_count, clip_ratio, peak_db, duration_sec
`POST`	`/v1/audio/mid-side`	JSON `{path\|url, size, output_format, ...}` — `mode=encode` (L/R→M/S) or `mode=decode` (M/S→L/R)
`POST`	`/v1/audio/beat-slice`	JSON `{path\|url, size, ...}` — ZIP stream of numbered beat slices; requires `librosa-analyze`
`POST`	`/v1/audio/conv-reverb`	JSON `{path\|url, size, output_format, ...}` — `ir_file_path` / `ir_file_url` required; `wet_mix` [0.0–1.0]
`POST`	`/v1/audio/transient`	JSON `{path\|url, size, output_format, ...}` — `attack_gain_db` + `sustain_gain_db`
`POST`	`/v1/audio/multiband-compress`	JSON `{path\|url, size, output_format, ...}` — N-band compressor; `crossovers_hz` + `bands` JSON arrays
`POST`	`/v1/audio/dj-prep`	JSON — bpm, key, camelot, integrated_lufs; requires `librosa-analyze` + `chord-detect`
`POST`	`/v1/audio/loop-point`	JSON — `{loop_start_sec,loop_end_sec,bars,score,tempo_bpm,candidates}`; requires `librosa-analyze`
`POST`	`/v1/audio/chords-to-midi`	JSON `{path\|url, size, ...}` — chord progression from audio (MIDI); requires `chord-detect`
`POST`	`/v1/audio/deess`	JSON `{path\|url, size, output_format, ...}` — split-band sibilance attenuation; `threshold_db`, `frequency_hz`, `ratio`
`POST`	`/v1/audio/stereo-field`	JSON — `{correlation, width, balance_db, mono_compatible, mid_level_db, side_level_db, phase_issues, …}`
`POST`	`/v1/audio/thumbnail`	JSON `{path\|url, size, start_sec, end_sec, ...}` — most energetic `duration_sec` segment; requires `librosa-analyze`

Workflow — presets, pipeline, catalog

Server-side multi-step chains + discovery. See Workflows for narrative + curl examples.

Method	Path
`GET`	`/v1/catalog`	machine-readable endpoint list grouped by category (17 categories)
`GET`	`/v1/ops`	list of pipeline op slugs (~24) usable in presets + `/v1/pipeline`
`GET`	`/v1/presets`	list curated server-side workflows (name + description)
`GET`	`/v1/presets/{name}`	describe one preset including all steps
`POST`	`/v1/presets/{name}`	JSON `{path\|url, size, steps, ...}` — run a curated preset; response includes a `steps` audit log of each op executed
`POST`	`/v1/pipeline`	JSON `{path\|url, size, steps, ...}` — ad-hoc `steps=[{op, params}, …]` chain, server-side intermediates; response includes a `steps` audit log

Batch

Method	Path
`POST`	`/v1/batch`	JSON body: array of op objects `{op, file_path, output_path, …}`. Returns `{results:[…]}` — errors per-op, not a 4xx. Supported ops: `convert`, `normalize`, `trim`, `fade`, `reverse`, `speed`, `eq`.

Async jobs

Every audio endpoint accepts "async_job": true in the JSON body. Optional "webhook_url" for push-style delivery. When async_job=true, the endpoint returns HTTP 202 with {job_id, status: "pending", status_url} instead of executing inline.

Method	Path
`GET`	`/v1/jobs`	list jobs; optional `?status=pending\|running\|completed\|failed\|cancelled`
`GET`	`/v1/jobs/{job_id}`	poll one job — returns status, result, duration_sec
`DELETE`	`/v1/jobs/{job_id}`	cancel running job or remove completed job

MIDI

Method	Path	Default returns
`POST`	`/v1/midi/compose`	JSON `{path\|url, size, ...}` — body is JSON song spec; writes MIDI
`POST`	`/v1/midi/inspect`	JSON — tempo, tracks, channels, note counts, time/key signatures
`POST`	`/v1/midi/transform`	JSON `{path\|url, size, ...}` — transpose, quantize, tempo override, channel filter; writes MIDI
`POST`	`/v1/midi/quantize`	JSON `{path\|url, size, ...}` — `grid_beats` snaps all note timings to a rhythmic grid; writes MIDI
`POST`	`/v1/midi/render`	JSON `{path\|url, size, output_format, ...}` — input MIDI via `file_path` / `file_url`; writes audio
`POST`	`/v1/midi/generate`	JSON `{path\|url, size, output_format, ...}` — body is JSON song spec (compose + render in one); writes audio
`POST`	`/v1/midi/drum`	JSON `{path\|url, size, ...}` — body is JSON step-sequencer spec; writes MIDI; requires `midi-compose`
`POST`	`/v1/midi/humanize`	JSON `{path\|url, size, ...}` — timing + velocity jitter; `timing_ms`, `velocity_pct`, `seed`; writes MIDI; requires `midi-compose`

File staging

Method	Path
`GET`	`/v1/files`	list staged files
`PUT`	`/v1/files/{path}`	upload
`GET`	`/v1/files/{path}`	download
`DELETE`	`/v1/files/{path}`	delete

Management

Method	Path
`GET`	`/healthz`	liveness — always unauthenticated
`GET`	`/v1/engines`	list configured engines + `loaded` / `idle_seconds` per engine
`GET`	`/v1/ps`	list engines in memory right now
`DELETE`	`/v1/ps/{engine}`	evict one engine
`POST`	`/v1/unload`	evict everything

MCP

audiolla exposes a Model Context Protocol server at /v1/mcp. Point any MCP-capable LLM agent at it and it gets the full audio processing surface as callable tools — separate stems, detect chords, transcribe to MIDI, diarize speakers, compose music from a JSON spec, read/write tags, submit async jobs — all over JSON-RPC without writing a line of integration code.

Audio-producing MCP tools follow the same contract as REST: callers MUST pass exactly one of output_path (server writes the result under FILES_DIR; client retrieves it via the get_file tool or HTTP GET /v1/files/<path>; response is {path, size, ...}) xor output_url (presigned PUT — server uploads the encoded bytes to the URL; response is {url, size, ...}). Both missing → ValueError; both set → ValueError. Inline base64 audio responses are gone in v1.0.0 — no audio_base64 / midi_base64 / image_base64 / video_base64 / zip_base64 fields exist anymore. Use list_jobs / get_job / cancel_job to manage long-running async work.

Endpoint: http://localhost:8000/v1/mcp

Tools:

Tool	What it does
`list_engines`	List configured engines and whether they're loaded
`list_presets`	List curated server-side workflows (name + description)
`describe_preset`	Show full step list of a preset before running
`list_ops`	List the ~24 pipeline op slugs available in `run_pipeline_tool` / presets
`run_preset`	Run a curated preset against an input file
`run_pipeline_tool`	Run an ad-hoc `[{op, params}, …]` chain server-side
`generate_music`	Text-to-audio — `engine` = `stable-audio-open` / `musicgen-small` / `musicgen-medium` / `riffusion` / `audioldm2`; `prompt` required, optional `lyrics`, `duration_sec`, `seed`. MusicGen requires `AUDIOLLA_ENABLE_NONCOMMERCIAL=1`. AudioLDM 2 is CC-BY 4.0 — commercial-safe with no opt-in.
`separate`	Demucs stem separation — per-stem staging via `output_paths={stem:path}` xor per-stem PUT via `output_urls={stem:url}`
`master`	Reference mastering (matchering) or preset chain (pedalboard)
`analyze`	BPM, key, LUFS, spectral features via librosa
`beats`	Beat grid — BPM + timestamps; optional click-track audio
`onsets`	Note onset timestamps
`melody`	Dominant melody contour in Hz; optional MIDI export
`segments`	Structural segmentation — recurring section labels (A, B, C…)
`silence`	Detect silent gaps; optional auto-trim (edges or all)
`visualize`	PNG spectrogram/waveform or animated MP4/WebM — `engine` + `mode` select output type
`fingerprint`	Chromaprint acoustic fingerprint (AcoustID-compatible)
`restore`	Remove reverb/echo/noise via UVR — `engine` selects model; `aggressive=true` for harder echo suppression
`denoise`	Thin shim — prefer `restore` with `engine=uvr-denoise` or `noise_reduce` with `engine=uvr-denoise`
`audio_to_midi`	Polyphonic audio-to-MIDI transcription via basic-pitch (ONNX) — writes MIDI to `output_path` xor `output_url`
`enhance`	Neural speech and vocal enhancement via DeepFilterNet DF3
`chords`	Chord and key detection via librosa — key + per-segment chord labels
`vad`	Voice activity detection via silero-vad — speech/non-speech segments with timestamps
`diarize`	Speaker diarization via pyannote — per-speaker timestamped segments
`transform`	Sox DSP chain — gain, EQ, reverb, pitch, tempo, etc.
`loudness`	Measure integrated LUFS — returns JSON only
`loudness_curve`	RMS envelope over time — `{curve:[{time_sec,rms_db}],duration,sample_rate,points}`
`normalize`	Normalize audio to a target LUFS level — writes to `output_path` xor `output_url`
`hpss`	Harmonic/percussive separation — writes per-stem audio to `output_paths={stem:path}` xor `output_urls={stem:url}`
`noise_reduce`	Noise reduction — `engine=noise-reduce` (DSP, stationary/prop_decrease) or `engine=uvr-denoise` (ML)
`stretch`	Time-stretch + pitch-shift via librosa phase vocoder
`pitch_correct`	Auto-tune toward nearest chromatic semitone — `strength` [0.0–1.0]; requires `librosa-analyze`
`repair_audio`	Declip + dehum — `declip` bool, `dehum` bool, `hum_freq` Hz
`tag`	Audio tagging via AST — top-K AudioSet labels with confidence scores
`embed`	512-dim CLAP audio embedding; with `query_text` returns cosine similarity
`classify`	Zero-shot CLAP classification — cosine similarity against any list of text labels
`info`	Probe audio metadata — duration, sample_rate, channels, codec, bit_depth
`trim`	Cut audio to [start_sec, end_sec) — writes to `output_path` xor `output_url`
`mix`	Mix N tracks with per-track gain — `tracks` list of {file_path/url, gain_db}
`concat`	Stitch N audio files end-to-end in order — `files` list of {file_path/url}
`speed`	Change playback speed without pitch shift — `speed` float (0.1–10.0)
`convert`	Re-encode: format, sample_rate, channels in one call
`similar`	Cosine similarity between two audio files via CLAP — returns `{similarity, dim}`
`midi_quantize`	Snap MIDI note timings to a rhythmic grid — `grid_beats` in beats
`fade`	Fade-in/fade-out with configurable duration and curve shape
`reverse`	Flip audio backwards
`loop`	Repeat audio N times — `count` total plays
`bpm_match`	Detect BPM then stretch to `target_bpm` — returns source/target BPM + tempo_factor
`stereo_width`	M/S stereo width — `width=0` mono, `1` original, `>1` wider
`split`	Split into equal parts or on silence — MCP form deprecated in v1.0.0; use REST `POST /v1/audio/split` with `output_path` for per-segment staging
`pan`	Pan in the stereo field — `position` [-1.0–1.0]
`eq`	Parametric EQ — `bands` list of `{freq, gain_db, width_hz}`
`key_match`	Detect key then pitch-shift to `target_key` — returns source_key + semitones
`sidechain_duck`	Duck primary track on trigger — `threshold_db`, `ratio`, `attack_ms`, `release_ms`
`fx`	Generic pedalboard effects chain — full catalog, your order and params
`midi_compose`	JSON song spec → MIDI; writes to `output_path` xor `output_url`
`midi_inspect`	Read MIDI structure — tempo, tracks, channels, note counts
`midi_transform`	Transpose, quantize, tempo override, channel filter on an existing MIDI file
`midi_render`	MIDI → audio via fluidsynth + SoundFont
`midi_generate`	One-shot compose + render — spec in, audio out
`drum_pattern`	Step-sequencer JSON spec → GM drum MIDI; `pattern` object of voice arrays, `swing`, `steps`, `bars`
`chords_to_midi`	Chord progression detected from audio → MIDI file; `tempo_bpm`, `velocity`, `octave` params
`audio_metadata`	Read or write audio tags — pass `tags` dict to write, omit to read
`detect_clipping`	Report digital clipping — clipped, clip_count, clip_ratio, peak_db
`mid_side`	M/S encode (`mode=encode`) or decode (`mode=decode`) stereo audio
`slice_at_beats`	Slice audio at beat positions — writes zip archive to `output_path` xor `output_url`; response includes `beat_count`
`convolution_reverb`	Apply IR reverb — `ir_file_path`/`ir_file_url` + `wet_mix` [0.0–1.0]
`transient_shaper`	Attack/sustain shaping — `attack_gain_db`, `sustain_gain_db`
`multiband_compress`	N-band compressor — `crossovers_hz` list + `bands` list of per-band specs
`dj_prep`	BPM + key + Camelot wheel + LUFS in one call
`find_loop_point`	Find best seamless loop boundary — `{loop_start_sec,loop_end_sec,bars,score,tempo_bpm,candidates}`
`deess`	Split-band sibilance attenuation — `threshold_db`, `frequency_hz`, `ratio`
`stereo_field`	Stereo field analysis — correlation, width, balance_db, mono_compatible, mid/side levels
`audio_thumbnail`	Extract most energetic segment — `duration_sec`; writes to `output_path` xor `output_url`; response includes `start_sec`/`end_sec`
`midi_humanize`	Add timing + velocity jitter to MIDI — `timing_ms`, `velocity_pct`, optional `seed` for deterministic output
`list_jobs`	List async jobs; optional `status` filter
`get_job`	Poll one async job by `job_id`
`cancel_job`	Cancel a running job or remove a completed one
`list_files`	List staged files
`put_file`	Upload a file (base64) to the staging area
`get_file`	Read a staged file back (base64)
`delete_file`	Remove a staged file

Auth (AUDIOLLA_AUTH_TOKEN) covers /v1/mcp the same as the REST endpoints — pass the bearer token in the Authorization header.

Configuration

Variable	Default
`AUDIOLLA_DEVICE`	`auto`	`auto`, `cpu`, `cuda`, or `cuda:N`
`AUDIOLLA_ENGINES_FILE`	`/app/engines.json`	path to engines registry
`AUDIOLLA_PRESETS_DIR`	`/app/presets`	directory of `*.yaml` preset workflows loaded at startup
`AUDIOLLA_DATA_DIR`	`/data`	where models and staged files live
`AUDIOLLA_UVR_MODELS_DIR`	`<DATA_DIR>/uvr_models`	where UVR model files are cached
`AUDIOLLA_AUTH_TOKEN`	—	bearer token; empty means no auth
`HF_TOKEN` / `HUGGINGFACE_TOKEN`	—	HuggingFace access token. The entrypoint mirrors the two names so setting either works. Required for the gated engines: `pyannote` speaker diarization, `stable-audio-open`, `musicgen-small`, `musicgen-medium`. Accept each model's licence on huggingface.co before using.
`LOG_LEVEL`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR` / `CRITICAL` (case-insensitive; `WARN` aliased to `WARNING`). Controls every audiolla logger + uvicorn's loggers. Logs are line-delimited JSON — each record carries `ts` / `level` / `logger` / `file` / `line` / `func` / `msg` plus `service` / `version` / `pid` / `host` / `thread`. HTTP requests additionally carry `request_id` (honoured from inbound `X-Request-Id`, else generated and echoed on the response), `method`, `path`, `status`, `duration_ms`, `client_ip`, `user_agent`, `req_bytes`, `resp_bytes`.
`AUDIOLLA_ENABLED_ENGINES`	(all)	comma-separated slugs to allow; empty = all
`AUDIOLLA_PRELOAD`	—	comma-separated slugs to load at startup
`AUDIOLLA_ENGINE_TTL`	`600`	seconds idle before an engine is unloaded (`10m` also works)
`AUDIOLLA_SWEEPER_INTERVAL`	`60`	how often the idle sweeper checks, in seconds
`AUDIOLLA_MAX_UPLOAD_BYTES`	`209715200`	upload cap (200 MB) — also caps URL fetch body size
`AUDIOLLA_FETCH_MODE`	`disabled`	`disabled`, `allowlist`, or `denylist` — controls server-side fetching for file_url / output_url
`AUDIOLLA_FETCH_HOSTS`	(none)	comma-separated host patterns (`bucket.s3.amazonaws.com`, `*.s3.amazonaws.com`). Required when mode=allowlist.
`AUDIOLLA_FETCH_SCHEMES`	`https`	comma-separated schemes — `https`, `http` (http opt-in only)
`AUDIOLLA_FETCH_ALLOW_PRIVATE`	`false`	allow URLs that resolve to private / loopback / link-local IPs
`AUDIOLLA_FETCH_TIMEOUT`	`30`	hard timeout per fetch/upload, in seconds (also accepts `30s`, `1m`)
`AUDIOLLA_FETCH_MAX_REDIRECTS`	`5`	max redirects per fetch; each Location re-validated through the policy
`AUDIOLLA_JOB_TTL`	`3600`	Seconds a completed/failed/cancelled job stays in memory before being swept. Also accepts `1h`, `30m`.
`AUDIOLLA_JOB_MAX_CONCURRENT`	`8`	Maximum number of async jobs that can run simultaneously.
`AUDIOLLA_SOUNDFONT`	`/usr/share/sounds/sf2/FluidR3_GM.sf2` (prod images)	Default SoundFont path for `/v1/midi/render`. Override per request via `soundfont_path`.

What's not in here

	Why
MusicGen / MAGNeT / JASCO	CC-BY-NC weights. Outclassed by ACE-Step (Apache 2.0) and DiffRhythm (Apache 2.0), both shipping in the box as of v1.0.0.
YuE 7B	Apache 2.0 but realistically needs 16-24 GB VRAM at fp16; doesn't fit comfortably on a 12 GB GPU without int4 quant tooling. Revisit when a 2B or quantised variant lands.
Essentia analysis	AGPL v3 — any network service using it has to publish full source. librosa handles the common cases without that.
Streaming separation	Demucs needs the whole file. No chunked or real-time inference.
VST3 plugin hosting	Pedalboard can do it but you'd need to mount your host plugin directory. Out of scope for the default image.
rubberband pitch/time-stretch	GPL v2 + commercial license. Sox handles basic pitch and tempo. Add it yourself if you accept the terms.

Build & dev

make build        # CPU image
make build-cuda   # CUDA image
make run          # CPU image on port 8000
make run-cuda     # CUDA image on port 8000

make dev-image          # build the dev container
make shell              # shell inside it
make lint               # flake8 + mypy
make format             # isort + black
make test-unit          # unit tests (no GPU, no ML deps needed)
make test-unit-cov-gate # fail if coverage on support modules drops below 80%
make test-integration   # integration tests (spins up Docker containers)
make generate           # regenerate src/audiolla/schema/ from openapi.yaml
make clean              # wipe build/cache artifacts

make pkg-lock                 # refresh uv.lock
make pkg-add PKG=name[==ver]  # add a dep
make pkg-update PKG=name      # upgrade one dep
make pkg-upgrade              # upgrade everything
make pkg-remove PKG=name      # remove a dep
make pkg-compile-heavy        # recompile requirements-heavy-{cpu,cuda}.txt

Every make pkg-* bumps [tool.uv] exclude-newer to UTC midnight 7 days before the bump date before touching anything — packages published in the last week are invisible to the resolver. The 7-day floor is the supply-chain attack window: fresh wheels (typosquats, hijacked maintainer releases) typically get caught and yanked within hours-to-days, so the floor gives malicious uploads a week of community scrutiny before they're eligible to enter the lockfile. Everything runs inside the dev container. Host needs docker, make, git.

Supply chain

Both prod images do a two-layer install.

Light deps (fastapi, uvicorn, pydantic, etc.): locked in uv.lock, installed with uv sync --frozen --no-dev. Build fails if the lockfile doesn't match pyproject.toml. Wheel hashes verified by uv.

Heavy ML/DSP deps (torch, demucs, matchering, pedalboard, librosa, sox, numpy, soundfile, huggingface-hub): one hash-locked requirements file per image variant (requirements-heavy-cpu.txt, requirements-heavy-cuda.txt), because the torch wheel differs between CPU and CUDA and lives on a different index. Human specs in scripts/heavy-deps-{cpu,cuda}.in, compiled via make pkg-compile-heavy, installed with uv pip install --require-hashes. Both files are committed.

Base images and the uv binary pinned by @sha256: digest.

License

WTFPL.

matchering and pedalboard are GPL v3. Fine for self-hosted use. Distributing the image as a product needs a GPL compliance review.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.agents/.skills/audiolla		.agents/.skills/audiolla
.github		.github
presets		presets
scripts		scripts
src/audiolla		src/audiolla
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.cuda		Dockerfile.cuda
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
engines-cpu.json		engines-cpu.json
engines.json		engines.json
entrypoint.sh		entrypoint.sh
openapi.yaml		openapi.yaml
pyproject.toml		pyproject.toml
requirements-heavy-cpu.txt		requirements-heavy-cpu.txt
requirements-heavy-cuda.txt		requirements-heavy-cuda.txt
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation