Local AI Implementation
TinkerPilot is designed to run entirely on your local machine, without relying on any cloud-based AI services. This is achieved by using a combination of local AI engines and carefully selected models that offer a good balance of performance and resource usage.
AI Engines
TinkerPilot uses three main AI engines for its features:
- Ollama: Ollama is used for running the Large Language Model (LLM) and the text embedding model. It provides a simple and efficient way to run state-of-the-art models on your local hardware, with automatic hardware acceleration (Apple Metal on macOS, CUDA on Linux).
- Moonshine Voice: Moonshine Voice is a lightweight and fast speech-to-text (STT) engine that runs locally. It is used for transcribing meetings and voice notes.
- Kokoro: Kokoro is a text-to-speech (TTS) engine that generates natural-sounding speech from text. It is used for the "speak" command in the CLI.
AI Models
The following models are used in TinkerPilot:
| Model | Purpose | Size | Engine |
|---|---|---|---|
| Qwen2.5-3B-Instruct | Chat, summarization, code analysis | ~2.0 GB | Ollama |
| Qwen3-Embedding 0.6B | Text embeddings for RAG | ~639 MB | Ollama |
| Moonshine Voice | Speech-to-text (streaming) | ~250 MB | Moonshine (ONNX) |
| Kokoro-82M | Text-to-speech (6 voices) | ~82 MB | PyTorch |
RAG Pipeline
The "Chat with Documents" feature is powered by a Retrieval-Augmented Generation (RAG) pipeline. The implementation can be found in backend/app/core/rag.py.
The RAG pipeline consists of two main processes: ingestion and querying.
Ingestion
The ingestion process involves the following steps:
- Parsing: The input file is parsed to extract its text content and metadata. The parser supports a wide range of file types, including PDF, Markdown, Python, and more.
- Chunking: The extracted text is split into smaller, overlapping chunks. This is done to ensure that the context provided to the LLM is focused and relevant.
- Embedding: Each chunk is converted into a vector embedding using the text embedding model (Qwen3-Embedding 0.6B).
- Storage: The embeddings are stored in a ChromaDB vector database. The metadata for the ingested documents is also stored in the SQLite database.
Querying
When you ask a question in the chat, the following steps are performed:
- Embedding: Your question is converted into a vector embedding using the same embedding model.
- Retrieval: The vector database is searched to find the most similar chunks to your question's embedding.
- Generation: The retrieved chunks are passed to the LLM as context, along with your original question. The LLM then generates an answer based on the provided context.
- Streaming: The answer is streamed back to the frontend in real-time, providing a responsive user experience.
Configuration
You can customize the AI models and other settings in the ~/.tinkerpilot/config.yaml file. For example, you can change the LLM model used for chat, the embedding model, and the size of the speech-to-text model.
hf_token: "hf_your_token_here..." # Set this to disable unauthenticated HF warnings
llm:
model_name: "qwen2.5:3b" # any model from: ollama list
temperature: 0.7
embedding:
model_name: "qwen3-embedding:0.6b" # or nomic-embed-text, mxbai-embed-large
stt:
model_size: small # tiny, small, medium
language: en
tts:
voice: "af_heart" # Kokoro voice (e.g., af_heart, am_adam, af_bella)
speed: 1.0
lang_code: "a" # a=American English, b=British
rag:
chunk_size: 512
top_k: 5
integrations:
obsidian_vault_path: ~/Documents/ObsidianVault
enable_apple_notes: true
This allows you to tailor the AI's performance and resource usage to your specific needs and hardware.