An experimental project exploring how voice models combined with fast, advanced LLMs can replace traditional customer service. The goal is to create AI agents that understand context better and answer customer questions more accurately than human representatives.
- LLM Understanding: Google Gemini 3 Flash (
gemini-3-flash-preview) processes customer queries with context awareness - Text-to-Speech: OpenAI
gpt-4o-mini-ttsor ElevenLabs converts responses to natural speech - Customer Service Mode:
--gemini-queryflag pipes input through LLM before voice synthesis - Multi-format Input: Supports Markdown, TXT, PDF, and DOCX files
-
Install dependencies:
python3 -m pip install -r requirements.txt
On Linux you may need
python3-tkfor the GUI picker. -
Copy and configure environment:
cp .env.local.example .env.local
-
Edit
.env.localwith your API keys:GEMINI_API_KEY- For LLM understanding (Get key)OPENAI_API_KEY- For OpenAI TTSELEVENLABS_API- For ElevenLabs TTS
Process a customer query through Gemini, then convert the response to speech:
python3 tts.py --gemini-query --text "How do I reset my password?"python3 tts.py --text "Your order has been shipped" --output notification.mp3python3 tts.py --input-file response.txt --output ./dist/response.wav --format wavpython3 tts.py --test| Option | Description |
|---|---|
--text |
Text string to convert to speech |
--input-file |
Path to a text file (MD/TXT/PDF/DOCX) |
--output |
Output file path (defaults to ~/Downloads/tts-output.mp3) |
--format |
Audio format: mp3 or wav (default: mp3) |
--provider |
TTS provider: openai or elevenlabs |
--gemini-query |
Process input through Gemini LLM first (customer service mode) |
--gemini-api-key |
Provide Gemini API key directly |
--voice |
Override the default voice |
--model |
Override the model name |
--api-key |
Provide TTS API key directly |
--project |
Project ID for OpenAI or ElevenLabs |
--choose-file |
Open GUI file picker |
--chunk-size |
Override chunking threshold (default: 3400 chars) |
--test |
Test API connections |
| Variable | Description |
|---|---|
GEMINI_API_KEY |
Google AI API key |
GEMINI_MODEL |
Model ID (default: gemini-3-flash-preview) |
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key (required for OpenAI TTS) |
OPENAI_MODEL |
TTS model (default: gpt-4o-mini-tts) |
OPENAI_VOICE |
Voice ID (default: alloy) |
OPENAI_PROJECT |
Project ID for project-scoped keys |
| Variable | Description |
|---|---|
ELEVENLABS_API |
ElevenLabs API key |
ELEVENLABS_MODEL |
Model (default: eleven_multilingual_v2) |
ELEVENLABS_VOICE |
Voice ID |
ELEVENLABS_STABILITY |
Voice stability (0-1) |
ELEVENLABS_SIMILARITY |
Similarity boost (0-1) |
ELEVENLABS_STYLE |
Style (0-1) |
ELEVENLABS_SPEAKER_BOOST |
Speaker boost (true/false) |
- Input: Customer query via text, file, or GUI picker
- LLM Processing (optional): Gemini 3 Flash analyzes the query and generates a helpful response
- Voice Synthesis: OpenAI or ElevenLabs converts the response to natural speech
- Output: MP3/WAV audio file ready for playback
The system uses a customer service-optimized prompt that instructs the LLM to be helpful, accurate, and empathetic. The TTS voice is configured for warm, clear delivery suitable for customer interactions.
*.md/*.txt- Read as UTF-8 (Markdown headings stripped for clean narration)*.pdf- RequiresPyPDF2*.docx- Requirespython-docx