Cache AI models in the browser

Most AI models have at least one thing in common: they are fairly large for a resource that's transferred over the Internet. The smallest MediaPipe object detection model (SSD MobileNetV2 float16) weighs 5.6 MB and the largest is around 25 MB.

The open-source LLM gemma-2b-it-gpu-int4.bin clocks in at 1.35 GB—and this is considered very small for an LLM. Generative AI models can be enormous. This is why a lot of AI usage today happens in the cloud. Increasingly, apps are running highly optimized models directly on-device. While demos of LLMs running in the browser exist, here are some production-grade examples of other models running in the browser:

Adobe Photoshop on the web with the AI-powered object selection tool open, with three objects selected: two giraffes and a moon.

To make future launches of your applications faster, you should explicitly cache the model data on-device, rather than relying on the implicit HTTP browser cache.

While this guide uses the gemma-2b-it-gpu-int4.bin model to create a chatbot, the approach can be generalized to suit other models and other use cases on-device. The most common way to connect an app to a model is to serve the model alongside the rest of the app resources. It's crucial to optimize the delivery.

Configure the right cache headers

If you serve AI models from your server, it's important to configure the correct Cache-Control header. The following example shows a solid default setting, which you can build on for your app's needs.

Cache-Control: public, max-age=31536000, immutable

Each released version of an AI model is a static resource. Content that never changes should be given a long max-age combined with cache busting in the request URL. If you do need to update the model, you must give it a new URL.

When the user reloads the page, the client sends a revalidation request, even though the server knows that the content is stable. The immutable directive explicitly indicates that revalidation is unnecessary, because the content won't change. The immutable directive is not widely supported by browsers and intermediary cache or proxy servers, but by combining it with the universally understood max-age directive, you can ensure maximum compatibility. The public response directive indicates that the response can be stored in a shared cache.

Chrome DevTools displays the production Cache-Control headers sent by Hugging Face when requesting an AI model. (Source)

Cache AI models client-side

When you serve an AI model, it's important to explicitly cache the model in the browser. This ensures the model data is readily available after a user reloads the app.

There are a number of techniques you can use to achieve this. For the following code samples, assume each model file is stored in a Blob object named blob in memory.

To understand the performance, each code sample is annotated with the performance.mark() and the performance.measure() methods. These measures are device-dependent and not generalizable.

In Chrome DevTools Application > Storage, review the usage diagram with segments for IndexedDB, Cache storage, and File System. Each segment is shown to consume 1354 megabytes of data, which totals to 4063 megabytes.

You can choose to use one of following APIs to cache AI models in the browser: Cache API, the Origin Private File System API, and IndexedDB API. The general recommendation is to use the Cache API, but this guide discusses the advantages and disadvantages of all options.

Cache API

The Cache API provides persistent storage for Request and Response object pairs that are cached in long-lived memory. Although it's defined in the Service Workers spec, you can use this API from the main thread or a regular worker. To use it outside a service worker context, call the Cache.put() method with a synthetic Response object, paired with a synthetic URL instead of a Request object.

This guide assumes an in-memory blob. Use a fake URL as the cache key and a synthetic Response based on the blob. If you would directly download the model, you would use the Response you would get from making a fetch() request.

For example, here's how to store and restore a model file with the Cache API.

const storeFileInSWCache = async (blob) => {
  try {
    performance.mark('start-sw-cache-cache');
    const modelCache = await caches.open('models');
    await modelCache.put('model.bin', new Response(blob));
    performance.mark('end-sw-cache-cache');

    const mark = performance.measure(
      'sw-cache-cache',
      'start-sw-cache-cache',
      'end-sw-cache-cache'
    );
    console.log('Model file cached in sw-cache.', mark.name, mark.duration.toFixed(2));
  } catch (err) {
    console.error(err.name, err.message);
  }
};

const restoreFileFromSWCache = async () => {
  try {
    performance.mark('start-sw-cache-restore');
    const modelCache = await caches.open('models');
    const response = await modelCache.match('model.bin');
    if (!response) {
      throw new Error(`File model.bin not found in sw-cache.`);
    }
    const file = await response.blob();
    performance.mark('end-sw-cache-restore');
    const mark = performance.measure(
      'sw-cache-restore',
      'start-sw-cache-restore',
      'end-sw-cache-restore'
    );
    console.log(mark.name, mark.duration.toFixed(2));
    console.log('Cached model file found in sw-cache.');
    return file;
  } catch (err) {    
    throw err;
  }
};

Origin Private File System API

The Origin Private File System (OPFS) is a comparatively young standard for a storage endpoint. It's private to the origin of the page, and thus is invisible to the user, unlike the regular file system. It provides access to a special file that is highly optimized for performance and offers write access to its content.

For example, here's how to store and restore a model file in the OPFS.

const storeFileInOPFS = async (blob) => {
  try {
    performance.mark('start-opfs-cache');
    const root = await navigator.storage.getDirectory();
    const handle = await root.getFileHandle('model.bin', { create: true });
    const writable = await handle.createWritable();
    await blob.stream().pipeTo(writable);
    performance.mark('end-opfs-cache');
    const mark = performance.measure(
      'opfs-cache',
      'start-opfs-cache',
      'end-opfs-cache'
    );
    console.log('Model file cached in OPFS.', mark.name, mark.duration.toFixed(2));
  } catch (err) {
    console.error(err.name, err.message);
  }
};

const restoreFileFromOPFS = async () => {
  try {
    performance.mark('start-opfs-restore');
    const root = await navigator.storage.getDirectory();
    const handle = await root.getFileHandle('model.bin');
    const file = await handle.getFile();
    performance.mark('end-opfs-restore');
    const mark = performance.measure(
      'opfs-restore',
      'start-opfs-restore',
      'end-opfs-restore'
    );
    console.log('Cached model file found in OPFS.', mark.name, mark.duration.toFixed(2));
    return file;
  } catch (err) {    
    throw err;
  }
};

IndexedDB API

IndexedDB is a well-established standard for storing arbitrary data in a persistent manner in the browser. It's infamously known for its somewhat complex API, but by using a wrapper library lsuch as idb-keyval you can treat IndexedDB like a classic key-value store.

For example:

import { get, set } from 'https://cdn.jsdelivr.net/npm/idb-keyval@latest/+esm';

const storeFileInIDB = async (blob) => {
  try {
    performance.mark('start-idb-cache');
    await set('model.bin', blob);
    performance.mark('end-idb-cache');
    const mark = performance.measure(
      'idb-cache',
      'start-idb-cache',
      'end-idb-cache'
    );
    console.log('Model file cached in IDB.', mark.name, mark.duration.toFixed(2));
  } catch (err) {
    console.error(err.name, err.message);
  }
};

const restoreFileFromIDB = async () => {
  try {
    performance.mark('start-idb-restore');
    const file = await get('model.bin');
    if (!file) {
      throw new Error('File model.bin not found in IDB.');
    }
    performance.mark('end-idb-restore');
    const mark = performance.measure(
      'idb-restore',
      'start-idb-restore',
      'end-idb-restore'
    );
    console.log('Cached model file found in IDB.', mark.name, mark.duration.toFixed(2));
    return file;
  } catch (err) {    
    throw err;
  }
};

Mark storage as persisted

Call navigator.storage.persist() at the end of any of these caching methods to request permission to use persistent storage. This method returns a promise that resolves to true if permission is granted, and false otherwise. The browser may or may not honor the request, depending on browser-specific rules.

if ('storage' in navigator && 'persist' in navigator.storage) {
  try {
    const persistent = await navigator.storage.persist();
    if (persistent) {
      console.log("Storage will not be cleared except by explicit user action.");
      return;
    }
    console.log("Storage may be cleared under storage pressure.");  
  } catch (err) {
    console.error(err.name, err.message);
  }
}

Special case: Use a model on a hard disk

You can reference AI models directly from a user's hard disk as an alternative to browser storage. This technique can help research-focused apps showcase the feasibility of running given models in the browser, or allow artists to use self-trained models in expert creativity apps.

File System Access API

With the File System Access API, you can open files from the hard disk and obtain a FileSystemFileHandle that you can persist to IndexedDB.

With this pattern, the user only needs to grant access to the model file once. Thanks to persisted permissions, the user can choose to permanently grant access to the file. After reloading the app and a required user gesture, such as a mouse click, the FileSystemFileHandle can be restored from IndexedDB with access to the file on the hard disk.

The file access permissions are queried and requested if necessary, which makes this seamless for future reloads. The following example shows how to get a handle for a file from the hard disk, and then store and restore the handle.

import { fileOpen } from 'https://cdn.jsdelivr.net/npm/browser-fs-access@latest/dist/index.modern.js';
import { get, set } from 'https://cdn.jsdelivr.net/npm/idb-keyval@latest/+esm';

button.addEventListener('click', async () => {
  try {
    const file = await fileOpen({
      extensions: ['.bin'],
      mimeTypes: ['application/octet-stream'],
      description: 'AI model files',
    });
    if (file.handle) {
      // It's an asynchronous method, but no need to await it.
      storeFileHandleInIDB(file.handle);
    }
    return file;
  } catch (err) {
    if (err.name !== 'AbortError') {
      console.error(err.name, err.message);
    }
  }
});

const storeFileHandleInIDB = async (handle) => {
  try {
    performance.mark('start-file-handle-cache');
    await set('model.bin.handle', handle);
    performance.mark('end-file-handle-cache');
    const mark = performance.measure(
      'file-handle-cache',
      'start-file-handle-cache',
      'end-file-handle-cache'
    );
    console.log('Model file handle cached in IDB.', mark.name, mark.duration.toFixed(2));
  } catch (err) {
    console.error(err.name, err.message);
  }
};

const restoreFileFromFileHandle = async () => {
  try {
    performance.mark('start-file-handle-restore');
    const handle = await get('model.bin.handle');
    if (!handle) {
      throw new Error('File handle model.bin.handle not found in IDB.');
    }
    if ((await handle.queryPermission()) !== 'granted') {
      const decision = await handle.requestPermission();
      if (decision === 'denied' || decision === 'prompt') {
        throw new Error(Access to file model.bin.handle not granted.');
      }
    }
    const file = await handle.getFile();
    performance.mark('end-file-handle-restore');
    const mark = performance.measure(
      'file-handle-restore',
      'start-file-handle-restore',
      'end-file-handle-restore'
    );
    console.log('Cached model file handle found in IDB.', mark.name, mark.duration.toFixed(2));
    return file;
  } catch (err) {    
    throw err;
  }
};

These methods aren't mutually exclusive. There may be a case where you both explicitly cache a model in the browser and use a model from a user's hard disk.

Demo

You can see all three regular case storage methods and the hard disk method implemented in the MediaPipe LLM demo.

Bonus: Download a large file in chunks

If you need to download a large AI model from the Internet, parallelize the download into separate chunks, then stitch together again on the client.

Here's a helper function that you can use in your code. You only need to pass it the url. The chunkSize (default: 5MB), the maxParallelRequests (default: 6), the progressCallback function (which reports on the downloadedBytes and the total fileSize), and the signal for an AbortSignal signal are all optional.

You can copy the following function in your project or install the fetch-in-chunks package from npm package.

async function fetchInChunks(
  url,
  chunkSize = 5 * 1024 * 1024,
  maxParallelRequests = 6,
  progressCallback = null,
  signal = null
) {
  // Helper function to get the size of the remote file using a HEAD request
  async function getFileSize(url, signal) {
    const response = await fetch(url, { method: 'HEAD', signal });
    if (!response.ok) {
      throw new Error('Failed to fetch the file size');
    }
    const contentLength = response.headers.get('content-length');
    if (!contentLength) {
      throw new Error('Content-Length header is missing');
    }
    return parseInt(contentLength, 10);
  }

  // Helper function to fetch a chunk of the file
  async function fetchChunk(url, start, end, signal) {
    const response = await fetch(url, {
      headers: { Range: `bytes=${start}-${end}` },
      signal,
    });
    if (!response.ok && response.status !== 206) {
      throw new Error('Failed to fetch chunk');
    }
    return await response.arrayBuffer();
  }

  // Helper function to download chunks with parallelism
  async function downloadChunks(
    url,
    fileSize,
    chunkSize,
    maxParallelRequests,
    progressCallback,
    signal
  ) {
    let chunks = [];
    let queue = [];
    let start = 0;
    let downloadedBytes = 0;

    // Function to process the queue
    async function processQueue() {
      while (start < fileSize) {
        if (queue.length < maxParallelRequests) {
          let end = Math.min(start + chunkSize - 1, fileSize - 1);
          let promise = fetchChunk(url, start, end, signal)
            .then((chunk) => {
              chunks.push({ start, chunk });
              downloadedBytes += chunk.byteLength;

              // Update progress if callback is provided
              if (progressCallback) {
                progressCallback(downloadedBytes, fileSize);
              }

              // Remove this promise from the queue when it resolves
              queue = queue.filter((p) => p !== promise);
            })
            .catch((err) => {              
              throw err;              
            });
          queue.push(promise);
          start += chunkSize;
        }
        // Wait for at least one promise to resolve before continuing
        if (queue.length >= maxParallelRequests) {
          await Promise.race(queue);
        }
      }

      // Wait for all remaining promises to resolve
      await Promise.all(queue);
    }

    await processQueue();

    return chunks.sort((a, b) => a.start - b.start).map((chunk) => chunk.chunk);
  }

  // Get the file size
  const fileSize = await getFileSize(url, signal);

  // Download the file in chunks
  const chunks = await downloadChunks(
    url,
    fileSize,
    chunkSize,
    maxParallelRequests,
    progressCallback,
    signal
  );

  // Stitch the chunks together
  const blob = new Blob(chunks);

  return blob;
}

export default fetchInChunks;

Choose the right method for you

This guide has explored various methods for effectively caching AI models in the browser, a task that's crucial for enhancing the user's experience with and the performance of your app. The Chrome storage team recommends the Cache API for optimal performance, to ensure quick access to AI models, reducing load times and improving responsiveness.

The OPFS and IndexedDB are less usable options. The OPFS and the IndexedDB APIs need to serialize the data before it can be stored. IndexedDB also needs to deserialize the data when it's retrieved, making it the worst place to store large models.

For niche applications, the File System Access API offers direct access to files on a user's device, ideal for users who manage their own AI models.

If you need to secure your AI model, keep it on the server. Once stored on the client, it's trivial to extract the data from both the Cache and IndexedDB with DevTools or the OFPS DevTools extension. These storage APIs are inherently equal in security. You might be tempted to store an encrypted version of the model, but you then need to get the decryption key to the client, which could be intercepted. This means a bad actor's attempt to steal steal your model is slightly harder, but not impossible.

We encourage you to choose a caching strategy that aligns with your app's requirements, target audience behavior, and the characteristics of the AI models used. This ensures your applications are responsive and robust under various network conditions and system constraints.


Acknowledgements

This was reviewed by Joshua Bell, Reilly Grant, Evan Stade, Nathan Memmott, Austin Sullivan, Etienne Noël, André Bandarra, Alexandra Klepper, François Beaufort, Paul Kinlan, and Rachel Andrew.