DocsVoice & speak›
Docs · Capabilities
Voice and speak
Fulcrum can listen and speak. /voice records from your mic, transcribes via Whisper-Large-v3 on scx.ai, and drops the text into your prompt for review before submit. /speakreads assistant replies back through scx.ai's TTS catalogue. Both are off by default — toggle on per-message or persistent.
/voice — mic to prompt
Type /voice in the TUI and hit Enter. The recording starts inline above the prompt input — no fullscreen modal, the transcript and prompt stay fully visible.
What happens
- PortAudio (via
sounddevice) opens the system default input. macOS may prompt for microphone permission on the first run; the indicator stays visible while the prompt is showing. - Audio is buffered locally and POSTed to scx.ai's
/audio/transcriptionsendpoint withWhisper-Large-v3. - The returned text drops into the prompt input. You can edit, prepend, or delete it before pressing Enter.
Review before submit
The transcribed text lands in the prompt input, not the wire. This is intentional — Whisper makes errors and the prompt is going to a coding agent that will read files, run shell, and edit your repo. You don't want stale or hallucinated input shipped without a glance.
Indicator UI
A one-line bar docks above the prompt input while you record: amber-on-dark when capturing, with elapsed time and a level meter. After you stop, the bar switches to a “processing on scx.ai…” animation while the transcription request is in flight, then disappears as the text appears in the prompt.
Stopping
While recording: Space stops capture and triggers transcription; Esc cancels without sending. Both are wired as priority bindings — they take effect even when the prompt input has focus.
Cross-platform notes
- macOS: Core Audio via PortAudio (preinstalled on stock macOS). The release workflow installs
portaudiovia Homebrew when building, but binary users get it for free. - Linux: install
libportaudio2andportaudio19-dev(e.g.sudo apt-get install libportaudio2 portaudio19-dev). Without it,/voiceshows a clean error and falls back to keyboard input. - Windows: WASAPI via PortAudio is bundled with the wheel.
/speak — assistant to audio
Per-message
After a turn, type /speak on its own to read a short summary of the last assistant reply through your system audio. /speak some text herespeaks the inline argument verbatim and skips summarisation. The MP3 is fetched from scx.ai's /audio/speechendpoint and handed to the platform's native player — afplay on macOS, paplay/aplay/ffplay on Linux, start on Windows.
Persistent auto-speak
/speak-on enables auto-speak: a one-line spoken summary of every terminal assistant reply, until disabled.
/speak-off disables it.
Voice selection
/voices opens a cursor picker over the catalogue (↑/↓ to navigate, Enter to pick, Esc to cancel). The choice persists to ~/.fulcrum/config.json as the tts_voice field — same shape as /models.
TTS model
tts-1 is the default. scx.ai exposes an OpenAI-compatible /audio/speech endpoint but with its own voice catalogue — these are not the OpenAI alloy/echo/nova names.
Playback
One clip plays at a time — issuing a new /speak terminates any in-flight playback before starting the next, so two clips never overlap. Replies above ~3500 characters are trimmed before synthesis to stay under the upstream per-request cap.
Configuration
Three fields in ~/.fulcrum/config.json control the voice surface. Defaults are sensible — most users never touch them.
| Field | Default | Effect |
|---|---|---|
transcription_model | Whisper-Large-v3 | Speech-to-text model used by /voice. |
tts_model | tts-1 | Text-to-speech model used by /speak. |
tts_voice | serene-assistant | Voice ID from the scx.ai catalogue (see below). |
Persist by editing ~/.fulcrum/config.json directly, or by setting the corresponding FULCRUM_* environment variable (e.g. FULCRUM_TTS_VOICE=alice-bennett) at shell startup. /voices does the same write for you, atomically, with mode 0600.
Voice catalogue
Six voices, hard-coded in the picker because scx.ai exposes no /voices endpoint as of this release. Pass the ID to tts_voice in config or pick visually with /voices.
| ID | Character |
|---|---|
serene-assistant | neutral, calm — good default |
alice-bennett | warm female, conversational |
ito | soft, narrator-like |
australian-sam | Australian male, casual |
friendly-kiwi | New Zealand male, upbeat |
likeable-aussie | Australian female, friendly |
Use cases
- Hands-free pair programming.
/speak-onwith an ambient mic is well-suited to conference calls where you want to watch the agent work and narrate decisions without typing every reaction. - Documentation by dictation.
/voicedrafts long prompts faster than typing — useful for design rationales, bug repros, and post-mortems where you want to think aloud first and edit the transcript second. - Accessibility.
/voiceplus/speak-onmakes Fulcrum operable without keyboard or screen for long stretches — recordings start with a single command and replies come back spoken.
Why scx.ai TTS instead of OpenAI
The voice catalogue is curated for Australian English — australian-sam, likeable-aussie, friendly-kiwi— which is the structural sovereign-AI choice carried into voice. Same Australian Privacy Act jurisdiction that applies to text inference applies to speech: the audio bytes you send and receive stay in scx.ai's Australian region, not a US-routed third party.
Privacy note
Your voice data goes to scx.ai for transcription and TTS. It is not stored locally beyond the in-flight buffer (an in-memory WAV chunk on the way out, an MP3 written to $TMPDIRfor the OS player on the way back). If you're discussing sensitive material — credentials, third- party customer data, anything covered by NDA — use the keyboard. The voice path is best-effort transcription, not a confidential channel.
Next steps
- Models → The scx.ai catalogue Fulcrum talks to —
Whisper-Large-v3,tts-1, and the text-completion models the same gateway routes. - Best practices → When to dictate, when to type, and how to keep auto-speak from becoming noise.