Overview

mlx-vlm runs vision-language models natively on Apple Silicon using the MLX framework. It supports inference and fine-tuning with unified memory — no GPU server needed.

Repo: Blaizzy/mlx-vlm
Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.10+

Installation

bash

# Create virtual environment (recommended)
python3 -m venv ~/.venvs/mlx-vlm
source ~/.venvs/mlx-vlm/bin/activate

# Install
pip install mlx-vlm

For development:

bash

git clone https://github.com/Blaizzy/mlx-vlm.git
cd mlx-vlm && pip install -e .

Supported Models

Model	HuggingFace ID	Best For
Pixtral	`mistral-community/pixtral-12b-240910`	General vision, multi-image
Qwen2-VL	`Qwen/Qwen2-VL-7B-Instruct`	OCR, document understanding
Phi-3-Vision	`microsoft/Phi-3.5-vision-instruct`	Lightweight, fast inference
LLaVA-1.6	`llava-hf/llava-v1.6-mistral-7b-hf`	Conversation about images
Llama-3.2-Vision	`meta-llama/Llama-3.2-11B-Vision-Instruct`	Strong general reasoning

Inference

CLI

bash

# Single image analysis
python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --image path/to/image.jpg \
  --prompt "Describe this image in detail" \
  --max-tokens 512

# Multi-image comparison
python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --image img1.jpg img2.jpg \
  --prompt "Compare these two images"

Python API

python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "mlx-community/pixtral-12b-240910-4bit"
model, processor = load(model_path)

prompt = apply_chat_template(
    processor,
    config=model.config,
    prompt="What objects are in this image?",
    images=["product.jpg"],
)

output = generate(
    model, processor, prompt,
    images=["product.jpg"],
    max_tokens=512,
    temperature=0.7,
)
print(output)

Batch Processing

python

import os, csv
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/pixtral-12b-240910-4bit")
image_dir = "images/"

results = []
for filename in os.listdir(image_dir):
    if not filename.lower().endswith((".jpg", ".png", ".webp")):
        continue
    path = os.path.join(image_dir, filename)
    prompt = apply_chat_template(
        processor, config=model.config,
        prompt="Describe this product photo. Include: category, color, condition, key features.",
        images=[path],
    )
    desc = generate(model, processor, prompt, images=[path], max_tokens=256)
    results.append({"file": filename, "description": desc})

with open("descriptions.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["file", "description"])
    writer.writeheader()
    writer.writerows(results)

Fine-Tuning

Prepare Dataset

Create JSONL with image paths and conversations:

json

{"image": "train/001.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Electronics, Subcategory: Headphones, Condition: New"}]}
{"image": "train/002.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Clothing, Subcategory: T-Shirt, Condition: Used - Good"}]}

Run Fine-Tuning (LoRA)

bash

python -m mlx_vlm.lora \
  --model mlx-community/pixtral-12b-240910-4bit \
  --data ./dataset \
  --train-file train.jsonl \
  --valid-file val.jsonl \
  --num-layers 8 \
  --batch-size 1 \
  --epochs 3 \
  --lr 1e-5 \
  --adapter-path ./adapters

Inference with Fine-Tuned Adapter

bash

python -m mlx_vlm.generate \
  --model mlx-community/pixtral-12b-240910-4bit \
  --adapter-path ./adapters \
  --image test.jpg \
  --prompt "Classify this product"

Cloud API Comparison

Factor	mlx-vlm (Local)	Cloud APIs (GPT-4V, Claude)
Cost	$0 after hardware	$0.01-0.04 per image
Privacy	Data stays local	Data sent to provider
Speed	~2-8s per image (M3 Max)	~1-3s per image
Offline	Yes	No
Custom models	LoRA fine-tuning	Limited / expensive
Quality	Good (7-12B models)	Excellent (frontier models)

Performance Tips

Use 4-bit quantized models (4bit in name) for 2-3x speedup with minimal quality loss
M3 Max / M4 Pro with 36GB+ RAM can run 12B models comfortably
For M1/M2 with 16GB, stick to 7B 4-bit models
Set MLX_METAL_JIT=1 for potential speedup on first run
Close memory-heavy apps before inference — unified memory is shared with system

mlx-vlm

Validation

Getting Started

Example Prompts

Documentation

Overview

Installation

Supported Models

Inference

CLI

Python API

Batch Processing

Fine-Tuning

Prepare Dataset

Run Fine-Tuning (LoRA)

Inference with Fine-Tuned Adapter

Cloud API Comparison

Performance Tips

Information

Use Cases