Apple’s AI getting crazy and nobody’s ta

Apple’s AI getting crazy and nobody’s talking about it

they just open sourced FastVLM + MobileCLIP2, can do realtime VIDEO captioning with phone camera.. runs 100% local on your phone, 85x faster, 3.4x smaller…

free to test and download, link in comment

let's break down:

here’s the problem with AI models, if you bump up the image resolution, the model gets smarter, more pixels, more detail

but the downside is, it gets slower, ’cause higher-res images take longer to process, and you end up with more visual tokens that slow down the LLM setup

some tests here, keeping same data, recipe, LLM but swap out the vision encoder

they compared with ViT‑L/14 and SigLIP‑SO400, a fully convolutional ConvNeXT, and hybrid FastViT models

FastViT is like 8× smaller and 20× faster than ViT‑L/14 while staying just as smart

FastViT’s cool, but when use high resolution, it starts to slow down

so they built FastViTHD, ditching the naive scaling and instead adding an extra stage, training it with MobileCLIP, and basically making fewer but better tokens

when they paired different image res, LLM sizes (0.5B, 1.5B, 7B), and both encoders, FastViTHD came out way ahead

in some cases, it’s up to 3× faster for the same accuracy

here’s where FastVLM comes in

they slap an MLP to project visual tokens from FastViTHD into the LLM’s world

the result: way fewer tokens (like 4× less than FastViT, 16× less than ViT‑L/14 at 336‑pixel res). I mean, that’s a big dropping in token count and complexity, while keeping things snappy

What’s cool is that they didn’t need to use those fancy token‑pruning or merging tricks that others use to speed stuff up. FastVLM just naturally delivers better accuracy across token counts since it’s generating higher‑quality tokens. Simpler to deploy and better results—win win

there’s also this thing called dynamic tiling

your classic AnyRes approach where you chop up images into tiles and process them separately, then feed everything to the LLM

they tested that with FastVLM too. turns out, without tiling, FastVLM still gives a smoother accuracy‑latency tradeoff

only at super high resolutions does tiling help much, and even then only with fewer tiles (like 2×2)

FastVLM is faster and more accurate than popular VLMs of the same size

FastVLM beat other models in both speed and smarts

it’s 85× faster than LLava‑OneVision (0.5B), 5.2× faster than SmolVLM (around 0.5B), and 21× faster than Cambrian‑1 (7B)

that’s a ridiculous leap

they even released an iOS/macOS demo so you can actually try it on your iPhone GPU

that’s wild..

link to models:

- FastVLM: https://huggingface.co/collections/apple/fastvlm-68ac97b9cd5cacefdd04872e

- MobileCLIP2: https://huggingface.co/collections/apple/mobileclip2-68ac947dcb035c54bcd20c47

Demo (+ source code): https://huggingface.co/spaces/apple/fastvlm-webgpu

if you found this inspiring,

follow @EHuanglu for more great stuff

and give it a like & repost to let more people know 👇

el.cine