Gemma 4 12B: A unified, encoder-free multimodal model

(blog.google)

250 points | by rvz 1 hour ago

19 comments

minimaxir 1 hour ago
The big story here is the encoder-free part, which I still don't fully understand.
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
[-]
- spott 18 minutes ago
  This is just early fusion basically.
  FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818
  I've been waiting for something like this to be released since then.
  The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).
- georgehm 40 minutes ago
  Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
- mchinen 26 minutes ago
  The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.
  > Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
  [-]
  - make3 13 minutes ago
    I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning
    [-]
    - neosat 7 minutes ago
      Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.
- jszymborski 1 hour ago
  Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
  [-]
  - minimaxir 1 hour ago
    In hindsight I may have been pedantic.
    [-]
    - wilkystyle 47 minutes ago
      I had a similar thought to you, and found your question and the resulting discussion helpful!
    - alberto467 43 minutes ago
      Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.
- kristjansson 1 hour ago
  > quantization
  12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
  But TBD how well the base model performs before thinking too much about quantization
- matja 48 minutes ago
  One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.
  [-]
  - pferdone 17 minutes ago
    But do I have the option to run it 'text only'?
- woadwarrior01 12 minutes ago
  There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.
  https://github.com/baaivision/EVE
- rao-v 26 minutes ago
  Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model
- reactordev 1 hour ago
  It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
- wolttam 1 hour ago
  I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
- LarsDu88 1 hour ago
  Well its a real simple encoder I guess
- GaggiX 1 hour ago
  > That's technically encoding
  Isn't that just projecting the patches into the d_model size vectors that the models takes?
  >I am assuming that involves of quantization
  12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
  [-]
  - minimaxir 1 hour ago
    The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."
    12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.
- fushigokira 1 hour ago
  [dead]
ethanpil 1 hour ago
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?
Is it simply goodwill and/or marketing? Or am I missing something strategic?
[-]
- browningstreet 1 hour ago
  This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.
  That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
- Mr_P 58 minutes ago
  Android and Chrome need on-device AI capabilities. Google can't lock down those weights like it can with server-side ML.
  So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.
  [-]
  - Aachen 47 minutes ago
    Could say the same for camera processing in the Pixel Camera app or any other binary someone wants to re-use that comes included in a software distribution (seemingly for 'free'). They can't lock the instructions up on the server so they might as well make the binary be freely distributable?
    Companies don't commonly give away executable binaries "just because", why'd they start now for these binary blobs that are the models?
    Not that I'm unhappy about it! Yay for open data any day, I'm just not understanding why, at least beyond PR in nerd circles
    [-]
    - jack_pp 28 minutes ago
      Because a model like this can't be as easily obfuscated as image processing. Image processing is a bundle of many moving parts, a lot of functions each with it's own inputs and outputs. A model is a single function which can be easily extracted and reused, in comparison
- gen220 31 minutes ago
  A big part of the frontier labs abilities to charge 80% gross margins on inference is having the cornered resource of frontier models.
  If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.
  Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.
  By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.
  It's a strategic play.
  [-]
  - zozbot234 28 minutes ago
    A 12B-sized model is a far cry from "frontier inference". That's more like DeepSeek V4 Pro territory which is a 1.6T model. Or for multi-modal models, Kimi 2.6 which is 1T.
    [-]
    - gen220 22 minutes ago
      at risk of quoting myself... :)
      > By offering frontier inference closer to cost *and* open-sourcing everything that's sub-frontier
      It's two prongs! One prong is that their frontier inference pricing is significantly cheaper/closer-to-at-cost as Anthropic's.
      The subject of this thread is the other prong: offering compelling models that are sub-frontier and self-hostable.
      Self-hosting models and at-cost frontier models are the high-end and low-end disruptions, respectively, to Ant/OAI/etc.'s business models.
      [-]
      - echelon 9 minutes ago
        Google needs an anti-trust breakup about 10 years ago.
        They need one more than ever now.
        This is ridiculously anti-competitive.
    - boutell 21 minutes ago
      You're right that it's not literally frontier. But like recent Qwen releases, it is a lot more capable than anybody thought models of this size could be a year ago, like capable enough to set a ceiling on what you can charge for AI for certain applications. Others still clearly justify a stronger model, but this trend may continue, etc.
- beambot 47 minutes ago
  Google is one of the few verticalized options in AI: Data, models, cloud services, low-level silicon (TPUs), internal use cases, retail use cases, B2B uses, distribution (browser & mobile), etc.
  They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.
- staticman2 31 minutes ago
  As long as Chinese firms are releasing good open models I imagine there isn't a huge downside for Google to release state of the art small models to compete in the "free" space.
- onlyrealcuzzo 1 hour ago
  If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.
  The question is: do you want to release your models, or use them purely for R&D?
  Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
  The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
- rootusrootus 58 minutes ago
  Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
  [-]
  - literalAardvark 22 minutes ago
    I think this plays a part, but the truth is that Google doesn't need to do that, Chinese open models are already doing that by themselves.
    So perhaps another part is just Google showing that they can indeed play at the big boys table.
- estearum 1 hour ago
  It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
- theturtletalks 1 hour ago
  Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
- ppeetteerr 1 hour ago
  Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
- CuriouslyC 1 hour ago
  They're trying to capture the segment of the market that wants to control the model, with the intent of getting you to run them on Vertex.
- stevenhubertron 50 minutes ago
  My guess is testing for Apple’s Siri replacement and partnership but that’s a total SWAG
- XzAeRosho 1 hour ago
  Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.
  I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.
  [-]
  - Aachen 43 minutes ago
    Google's "free" is and was ad-supported, even if some products now have a paid tier. These models don't include ads. Doesn't seem like the same underlying reason
- mmarian 1 hour ago
  Marketing + Pro Serv if I had to take a guess.
- dist-epoch 38 minutes ago
  Evangelism for AI. Google is one of the big AI providers.
  Eventually the local model is not enough, and you'll upgrade to the big ones.
- accountrequired 1 hour ago
  edge compute
- superchicken099 1 hour ago
  Gemma overtakes and kills real open-source AI projects, pushing people who would support them towards enterprises like Google
spott 11 minutes ago
Is there a paper on this?
I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.
I wonder how hard it would be to add it back on.
[-]
- joaogui1 6 minutes ago
  I mean Claude is multimodal on input but not output, why couldn't this also be?
ComputerGuru 38 minutes ago
Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!
A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
lxgr 36 minutes ago
Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?
[-]
- philipkglass 34 minutes ago
  Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:
  https://github.com/ggml-org/llama.cpp/pull/24077
  [-]
  - zozbot234 14 minutes ago
    They've actually gone back to (a lightly patched) llama.cpp with the 0.30 release a few weeks ago, and have now vendored-in an up to date release. Needless to say this is great news for both projects!
mlmonkey 13 minutes ago
Is there some place where we can try it before downloading the gigabytes of weights?
Zambyte 57 minutes ago
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.
[0] https://ollama.com/library/gemma4/tags
Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
[-]
- embedding-shape 51 minutes ago
  MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.
  I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
  Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.
  [-]
  - sambaumann 26 minutes ago
    I still get "this model requires macOS" when trying to pull that one
    [-]
    - embedding-shape 20 minutes ago
      I don't use Ollama myself anymore, but seems others been having similar issues for quite some time, maybe one of these fit your environment exactly? https://github.com/ollama/ollama/issues?q=is%3Aissue%20state...
- jw1224 48 minutes ago
  MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/
- jasonjmcghee 28 minutes ago
  There's a CUDA backend for MLX now. Not sure about the maturity.
dwa3592 1 hour ago
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
Havoc 42 minutes ago
Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE
[-]
- dist-epoch 36 minutes ago
  The un-quantized MoE outperforms it.
  But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.
  All the launch benchmarks are at 16 bit.
randomNumber7 57 minutes ago
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
[-]
- spott 13 minutes ago
  https://newsletter.maartengrootendorst.com/p/a-visual-guide-... (in a link from here: https://developers.googleblog.com/gemma-4-12b-the-developer-..., which was linked in the text of the post, but not the linkdump at the end).
nickandbro 1 hour ago
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
[-]
- embedding-shape 1 hour ago
  I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
  [-]
  - seba_dos1 59 minutes ago
    Note that a binary released under Apache 2.0 license does not yet make it FOSS.
    [-]
    - embedding-shape 53 minutes ago
      Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.
      Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
- brianwawok 57 minutes ago
  Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
  [-]
  - wongarsu 35 minutes ago
    Gemma 4 27b and 32b feel pretty capable for text and visionn. Comparable with qwen, maybe a bit better on tool calling heavy tasks
    I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others
- redman25 1 hour ago
  IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.
djyde 55 minutes ago
What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
[-]
- philipkglass 6 minutes ago
  I have vLLM running on a Linux machine in my basement, connected with Tailscale, and I use small models as part of tasks like this:
  - Transcribing scanned documents into formatted text
  - Captioning/describing images and classifying them for audience suitability (includes anti-spam)
  - Matching documents with relevant Wikipedia pages for tagging
  I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.
  I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.
- mhitza 5 minutes ago
  In theory, locally you'd use these where lossiness is acceptable for audio transcription and image labeling (as simple examples).
  In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.
- properbrew 15 minutes ago
  I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).
  If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.
- robgough 15 minutes ago
  I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.
  I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.
  There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.
- Aachen 37 minutes ago
  "Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same
- Xiol 38 minutes ago
  I've yet to see someone answer a question like this with a decent, useful answer.
BiraIgnacio 37 minutes ago
using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.
digdugdirk 43 minutes ago
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?
[-]
- claysmithr 9 minutes ago
  I have 24 gb unified memory so it’s a good model for me
- vehemenz 24 minutes ago
  This comment has me a bit confused.
  Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.
  [-]
  - utternerd 3 minutes ago
    Unified Memory or VRAM, not just RAM.
  - NekkoDroid 14 minutes ago
    I think you are mixing up RAM and VRAM.
    [-]
    - crims0n 1 minute ago
      They are effectively one and the same on Apple Silicon.
    - Schiendelman 3 minutes ago
      On a Mac they are the same thing; they're shared. Of course you need some amount for the OS, but if you have an Apple Silicon Mac with 24GB of RAM, you can likely run a 16GB model.
claysmithr 38 minutes ago
I don’t see the download in lm studio
[-]
- deckar01 9 minutes ago
  It also says it is supposed to be available in their own Edge Gallery app and it’s not there (on iOS).
zuminator 1 hour ago
How does it compare with e4b, aside from being larger?
[-]
- anonova 49 minutes ago
  There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...
- thomasjb 56 minutes ago
  That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me
jdelman 52 minutes ago
I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.