SwiftMac: A Native macOS Speech Server for Emacspeak

SwiftMac: A Native macOS Speech Server for Emacspeak

Emacspeak turns Emacs into a complete audio desktop for blind and low-vision developers. It needs a speech server to convert text and audio cues into spoken output. SwiftMac implements this server natively in Swift, using macOS speech synthesis APIs directly.

The server receives commands via stdin, manages speech queues, and outputs audio through macOS AVSpeechSynthesizer. Async from the ground up. Fast. Responsive.

The Emacspeak Protocol#

Speech servers receive single-letter commands on separate lines:

  • q - Queue speech (add to buffer, don’t speak yet)
  • d - Dispatch queue (speak everything queued)
  • s - Stop all speech immediately
  • l - Instant letter (speak one character now)
  • a - Queue audio icon (earcon)
  • p - Play audio icon instantly
  • t - Queue tone
  • sh - Queue silence
  • c - Queue code (handle specially for programming)

Additional commands control voice parameters:

  • tts_set_speech_rate - Adjust speaking speed
  • tts_set_punctuations - Control which punctuation gets spoken
  • tts_split_caps - Split CamelCase words
  • tts_set_character_scale - Adjust pitch for character echo
  • tts_set_voice - Change voice
  • tts_pause / tts_resume - Pause and resume speech

Example interaction:

stdin: q
stdin: This is some text
stdin: d
[speaks: "This is some text"]

stdin: l
stdin: x
[speaks: "x"]

stdin: s
[stops all speech immediately]

Architecture#

SwiftMac is built around async Swift and AVSpeechSynthesizer:

// Simplified core structure
class SpeechServer {
    private let synthesizer = AVSpeechSynthesizer()
    private var speechQueue: [AVSpeechUtterance] = []
    private var currentVoice: AVSpeechSynthesisVoice

    func processCommand(_ command: String, _ text: String) async {
        switch command {
        case "q":
            queueSpeech(text)
        case "d":
            await dispatchQueue()
        case "s":
            stopAll()
        case "l":
            await speakInstant(text)
        // ... other commands
        }
    }
}

The server runs an async loop reading stdin, parsing commands, and managing speech output concurrently.

Buffering and Voice Switching#

SwiftMac outputs to a buffer and trims empty noise. This enables voice changes within a single line without gaps.

Emacspeak uses different voices for semantic meaning - bold text sounds different from normal text, comments sound different from code. When a line contains multiple voice changes (like # bold code comment), SwiftMac needs to switch voices mid-line.

Without buffering, voice switches create gaps:

# bold → [gap] → code → [gap] → comment

VoiceOver adds silence between separate speech utterances. Users hear awkward pauses that don’t reflect the actual text structure.

SwiftMac buffers output and trims empty noise between voice segments. Voice switches happen without perceptible gaps:

# bold code comment
[speaks as continuous line with voice changes, no gaps]

This makes multi-voice output natural. Technical content with inline comments, code with syntax highlighting, and documentation with formatting all speak fluidly.

Voice Configuration#

SwiftMac exposes macOS system voices through the emacspeak voice system:

;; Configure voices in Emacs Lisp
(swiftmac-define-voice voice-bolden
  "[{voice en_US:Fred}] [[pitch 0.8]]")

(swiftmac-define-voice voice-animate
  "[{voice en-US:Kit}] [[pitch 1]]")

(swiftmac-define-voice voice-lighten
  "[{voice en-AU:Matilda}] [[pitch 1]]")

Each voice can have different pitch, rate, and volume. Emacspeak uses these to convey semantic information - bold text sounds different from normal text, comments sound different from code.

Audio Icons and Tones#

Beyond speech, SwiftMac handles audio cues:

// Audio icons ("earcons") provide non-speech feedback
func playAudioIcon(_ icon: String) {
    // Load and play sound file
    let sound = NSSound(named: icon)
    sound?.play()
}

// Tones provide quick auditory landmarks
func playTone(frequency: Double, duration: Double) {
    // Generate and play sine wave
    let tone = generateTone(frequency, duration)
    play(tone)
}

These auditory cues give immediate feedback without interrupting speech flow. Opening a file, switching buffers, completing a command - each has distinct audio feedback.

Volume Control#

Three independent volume controls via environment variables:

export SWIFTMAC_VOICE_VOLUME="1.0"  # Speech volume
export SWIFTMAC_TONE_VOLUME="0.1"   # Tones quieter
export SWIFTMAC_SOUND_VOLUME="0.1"  # Audio icons quieter

Adjust the mix for your preferences. Speech at full volume, subtle audio cues in background.

Design Philosophy#

Three core principles:

1. Server should be as dumb as possible

Complex decisions belong in Emacs Lisp, not the speech server. SwiftMac provides primitives. Emacspeak orchestrates them.

2. Usable by default

Once it builds, it works. No configuration file. No setup wizard. Sensible defaults.

3. No secret dependencies

Only dependencies are checked at compile time. No hidden requirements for command-line tools that might not exist.

Building#

make          # Debug build
make release  # Optimized build

Outputs to .build/debug/swiftmac or .build/release/swiftmac. Symlink to your emacspeak servers directory.

Concurrent Speech Processing#

SwiftMac handles multiple concurrent streams:

// Queue management with async/await
actor SpeechQueue {
    private var queue: [SpeechItem] = []

    func enqueue(_ item: SpeechItem) {
        queue.append(item)
    }

    func dispatch() async {
        for item in queue {
            await speak(item)
        }
        queue.removeAll()
    }
}

Using Swift’s actor model ensures thread-safe queue operations without manual locking.

Integration with Emacspeak#

Emacspeak starts the server and communicates via pipes:

;; In emacspeak configuration
(setq dtk-program "swiftmac")

;; Emacspeak sends commands like:
;; q\nThis is text\nd\n
;; SwiftMac receives on stdin, speaks, continues

The protocol is text-based and simple. Easy to debug with shell scripts or manual testing.

Testing#

# Test manually
echo -e "q\nHello world\nd" | ./swiftmac

# Test specific commands
echo -e "l\nA" | ./swiftmac  # Speak letter "A"

# Test voice switching
echo -e "tts_set_voice en-US:Fred\nq\nTesting Fred\nd" | ./swiftmac

Performance#

Speech synthesis happens asynchronously. Emacspeak never blocks waiting for speech. Commands queue instantly. Dispatch happens in background.

On modern Macs (M1/M2), SwiftMac handles speech commands with sub-millisecond latency. Queue processing is limited by actual speech duration, not processing overhead.

Status and Future#

SwiftMac is production-ready and actively maintained. It ships with emacspeak and gets regular updates.

Current work focuses on v2 architecture improvements and full protocol coverage. Some commands remain unimplemented (language switching), but core functionality is complete and stable.

Available at github.com/robertmeta/swiftmac under GPL license.