Terminal.skills
Skills/mlx-vlm
>

mlx-vlm

Run Vision Language Models locally on Apple Silicon Macs using MLX. Use when: installing mlx-vlm, running VLM inference (image + text → response), fine-tuning vision models on custom datasets, batch processing images with local AI, comparing local VLM to cloud APIs (GPT-4V, Claude Vision), or working with LLaVA, Phi-3-Vision, Qwen2-VL, Pixtral, Llama-3.2-Vision on Mac.

#mlx#vision#apple-silicon
terminal-skillsv1.0.0
Works with:claude-codeopenai-codexgemini-clicursor
Source

Usage

$
✓ Installed mlx-vlm v1.0.0

Getting Started

  1. Install the skill using the command above
  2. Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
  3. Reference the skill in your prompt
  4. The AI will use the skill's capabilities automatically

Example Prompts

  • "Analyze the sales data in revenue.csv and identify trends"
  • "Create a visualization comparing Q1 vs Q2 performance metrics"

Information

Version
1.0.0
Author
terminal-skills
Category
Data & AI
License
Apache-2.0

Documentation

Overview

mlx-vlm runs vision-language models natively on Apple Silicon using the MLX framework. It supports inference and fine-tuning with unified memory — no GPU server needed.

Repo: Blaizzy/mlx-vlm
Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.10+

Installation

bash
# Create virtual environment (recommended)
python3 -m venv ~/.venvs/mlx-vlm
source ~/.venvs/mlx-vlm/bin/activate

# Install
pip install mlx-vlm

For development:

bash
git clone https://github.com/Blaizzy/mlx-vlm.git
cd mlx-vlm && pip install -e .

Supported Models

ModelHuggingFace IDBest For
Pixtralmistral-community/pixtral-12b-240910General vision, multi-image
Qwen2-VLQwen/Qwen2-VL-7B-InstructOCR, document understanding
Phi-3-Visionmicrosoft/Phi-3.5-vision-instructLightweight, fast inference
LLaVA-1.6llava-hf/llava-v1.6-mistral-7b-hfConversation about images
Llama-3.2-Visionmeta-llama/Llama-3.2-11B-Vision-InstructStrong general reasoning

Inference

CLI

bash
# Single image analysis
python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --image path/to/image.jpg \
  --prompt "Describe this image in detail" \
  --max-tokens 512

# Multi-image comparison
python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --image img1.jpg img2.jpg \
  --prompt "Compare these two images"

Python API

python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "mlx-community/pixtral-12b-240910-4bit"
model, processor = load(model_path)

prompt = apply_chat_template(
    processor,
    config=model.config,
    prompt="What objects are in this image?",
    images=["product.jpg"],
)

output = generate(
    model, processor, prompt,
    images=["product.jpg"],
    max_tokens=512,
    temperature=0.7,
)
print(output)

Batch Processing

python
import os, csv
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/pixtral-12b-240910-4bit")
image_dir = "images/"

results = []
for filename in os.listdir(image_dir):
    if not filename.lower().endswith((".jpg", ".png", ".webp")):
        continue
    path = os.path.join(image_dir, filename)
    prompt = apply_chat_template(
        processor, config=model.config,
        prompt="Describe this product photo. Include: category, color, condition, key features.",
        images=[path],
    )
    desc = generate(model, processor, prompt, images=[path], max_tokens=256)
    results.append({"file": filename, "description": desc})

with open("descriptions.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["file", "description"])
    writer.writeheader()
    writer.writerows(results)

Fine-Tuning

Prepare Dataset

Create JSONL with image paths and conversations:

json
{"image": "train/001.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Electronics, Subcategory: Headphones, Condition: New"}]}
{"image": "train/002.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Clothing, Subcategory: T-Shirt, Condition: Used - Good"}]}

Run Fine-Tuning (LoRA)

bash
python -m mlx_vlm.lora \
  --model mlx-community/pixtral-12b-240910-4bit \
  --data ./dataset \
  --train-file train.jsonl \
  --valid-file val.jsonl \
  --num-layers 8 \
  --batch-size 1 \
  --epochs 3 \
  --lr 1e-5 \
  --adapter-path ./adapters

Inference with Fine-Tuned Adapter

bash
python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --adapter-path ./adapters \
  --image test.jpg \
  --prompt "Classify this product"

Cloud API Comparison

Factormlx-vlm (Local)Cloud APIs (GPT-4V, Claude)
Cost$0 after hardware$0.01-0.04 per image
PrivacyData stays localData sent to provider
Speed~2-8s per image (M3 Max)~1-3s per image
OfflineYesNo
Custom modelsLoRA fine-tuningLimited / expensive
QualityGood (7-12B models)Excellent (frontier models)

Performance Tips

  • Use 4-bit quantized models (4bit in name) for 2-3x speedup with minimal quality loss
  • M3 Max / M4 Pro with 36GB+ RAM can run 12B models comfortably
  • For M1/M2 with 16GB, stick to 7B 4-bit models
  • Set MLX_METAL_JIT=1 for potential speedup on first run
  • Close memory-heavy apps before inference — unified memory is shared with system