Local LLMs on Mobile Are a Gimmick—For Now

Kewin Wereszczyński
No items found.

In short

This article explores the current state of running local LLMs on mobile, breaking down the challenges of model size, inference efficiency, and hardware limitations. By the end, you'll understand the trade-offs, available frameworks, and emerging optimizations—helping you decide whether on-device AI inference is worth exploring for your use case.

With LLMs becoming more optimized for local AI inference on mobile devices and more performant with each passing day, you can easily imagine the moment when, instead of relying on API calls, we’ll run AI models locally on devices. In this article, we will explore the ups and downs of that solution at the moment (this may change quickly). To provide more context, we will focus on using LLM models in a development environment, not just installing an app that allows us to chat.

Issues with running AI models locally on mobile

I’ll start with the elephant in the room and cut through the hype. Why aren’t local LLMs  production-ready on mobile yet, despite improvements from models like Llama and Mistral, which aim to run efficiently on consumer hardware? Please keep in mind that I do believe those issues will, in fact, be resolved in the future, and they shouldn’t stop you from learning about the topic. First things first—what actually is a “model”?

In simplest terms, it’s a file containing a bunch of multidimensional arrays of numbers. A model consists of two key components: the architecture (the blueprint) and the weights (the learned numerical parameters). A useful analogy is a class declaration (the architecture) and an object instance snapshot (the weights).

You might now wonder how big that file can be. Well, it depends on the model. For clarity, let’s focus on DeepSeek-R1-Distill-Qwen-1.5B. This is the smallest DeepSeek model I found, and it still weighs 3.5 GB! But, since it has amazing capabilities (that are getting better), it’s justified. Still, this takes us to the first issue with running AI locally—the model size.

Model size

Models are big. That’s a fact. So the question thing we need to answer is, “How do we even deliver the model to the device?”. Three options come to mind:

  • Bundling it with the app (iOS App Store limits apps to 4 GB, but side-loading allows larger sizes).
  • Downloading the model after the app is installed (requires a source).
  • Asking the user to provide the model (requires a tech-savvy user).

Since the model we picked fits within the iOS limits, let’s bundle it, even though side-loading is the best option. It’s also worth mentioning that techniques like Quantisation can reduce the model size, improving LLM efficiency on mobile devices. Now, how do we actually run it?

Model formats

AI models come in multiple different formats. Just like images have PNGs, JPGs, and GIFs, LLMs come in various formats, like Safetensors (efficient storage), GGUF (optimized for CPU-based inference), and ONNX (cross-framework compatibility). So, before passing the model to our mobile for inference, we need to ensure that it’s properly processed and prepared. This topic is big enough for its own deep dive. Our DeepSeek model is in the Safetensors format, so what are our options? It depends on what acceleration we will use for running AI locally on the device.

Model inference

Even after limiting ourselves to iOS and Android, we still have many options. We can use ExecuTorch, LiteRT (formerly TensorFlow Lite), ONNX Runtime, or Core ML (iOS exclusive). Why do we even need some external framework for running AI locally?

Firstly, we need a tool to read the model file and create virtual layers from the blueprint. Then, we need to efficiently load up the weights (it’s like initializing the class object). After the framework helps us with that, it also needs to provide low-level access to the runtime environment to fully use GPU capabilities and possible acceleration. As above, this topic is big enough for a separate deep dive.

But let’s say we managed to pick a framework, convert the model, and load it up. How does it work?

Performance trade-offs: Speed vs. quality

Inference on mobile devices needs to balance a couple of things to get to a satisfactory end result. If we get the response fast, the quality might suffer. On the other hand, a better-quality response might take time. Remember that the response quality of a smaller AI model on mobile is nowhere near the quality of models from external providers. An additional issue is device heating and battery consumption. Even when the AI model runs on-device, it still requires huge amounts of resources.

Let’s try running the model on mobile anyway!

After deciding on model, I searched for the perfect framework to use. I found MLC (Machine Learning Compilation). Why is it a great choice for running AI locally? It is basically like dockerization, but for ML models! It simplifies many steps and helps to stop thinking about multiple models’ architectures and formats.

MLC allows you to compile your model before deploying it to the device. It checks out all of the functions required to run the model, optimizes it, and attaches all functions needed to run it. After it’s completed, your device needs just to run the functions!

There are two ways to achieve that; let’s start with the harder one:

  1. Install mlc_llm (but install it from the source!).
  2. Find a model you find interesting on Hugging Face.
  3. Git clone that model.
  4. Run convert_weight and gen_config commands described here.
  5. Once it’s done, package the model as described here.

This way, you will get a nicely packaged LLM model ready to be run on the mobile. Alternatively, you can find a lot of ready-to-go models that were already compiled on Hugging Face, like this one.

Now, all we need is a way to run that model. We went ahead and prepared a nice project to help you with that. You can find it here.

Are local LLMs on mobile worth it?

We have already started exploring that topic with our Open Source project, and we can confirm that it’s a very fun and progressive area. And after seeing all of those issues, why should we care about running LLMs on mobile devices? In short—it’s the future.

Currently, this technology is the worst it will ever be, and it’s pretty impressive already. We’re already seeing optimizations like distillation, quantization, and MoE (Mixture of Experts) architectures making local models more viable, and devices will get powerful enough so that everyone will have their own Jarvis in their pocket.

Another important point is that even if you don’t plan on working with LLMs, the same workflow will allow you to run any AI model on the mobile device. Object detection, face detection, language correction, or AR directions are all AI models in disguise.

Latest update:
March 11, 2025

FAQ

No items found.
React Galaxy City
get in touch

This information will be used only to contact you. For details, check our Privacy Policy.

React Galaxy City
get in touch