CLIP模型在移动端的应用

Lulin Lv2

CLIP 模型在移动端的应用

0. CLIP

CLIP(Contrastive Language-Image Pre-Training)是 OpenAI 在 2021 年推出用于匹配图像和文本的多模态模型。

给定图像和文本描述,该模型可以预测与该图像最相关的文本描述

使用 PyTorch 调用 CLIP 模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import clip
from PIL import Image

device = "mps"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("img.jpg")).unsqueeze(0).to(device)

text = clip.tokenize(["a diagram", "a dog", "a cat", "a white cat", "oreo"]).to(device)

with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)

logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

# Label probs: [[4.358 3.483 9.961]]

1. CoreML

CoreML是 Apple 在 iOS11 推出的机器学习框架,可以将训练好的模型应用到 APP 当中。CoreML 支持多种机器学习模型,包括神经网络、树集成、支持向量机和广义线性模型。

2. CLIP 转 CoreML

使用 coremltools 工具,可以把 CLIP 模型转换为 CoreML 的模型格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import torch
import coremltools as ct
import clip
import numpy as np
from PIL import Image
from transformers import CLIPTextModelWithProjection, CLIPTokenizerFast

model_id = "openai/clip-vit-base-patch32"
model = CLIPTextModelWithProjection.from_pretrained(model_id, return_dict=False)
tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
model.eval()

example_input = tokenizer("a photo of a cat", return_tensors="pt")
example_input = example_input.data['input_ids']

traced_model = torch.jit.trace(model, example_input)

max_seq_length = 76 # if max_seq_length is 77 as in the original model, the validation fails, see details at the end of the notebook. Set max_seq_length to 76 works fine with the app.
text_encoder_model = ct.convert(
traced_model,
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS16,
inputs=[ct.TensorType(name="prompt",
shape=[1,max_seq_length],
dtype=np.int32)],
outputs=[ct.TensorType(name="embOutput", dtype=np.float32),
ct.TensorType(name="embOutput2", dtype=np.float32)],
)
text_encoder_model.save("TextEncoder_float32_test.mlpackage")

model = ct.models.MLModel('TextEncoder_float32_test.mlpackage')

# Choose a tokenizer, here we use the clip tokenizer
text = clip.tokenize("a photo of a cat")
text = text[:,:max_seq_length]

# # Or use CLIPTokenizerFast
# text = tokenizer("a photo of a cat", return_tensors="pt", padding="max_length", max_length=max_seq_length)
# text = text.data['input_ids'].to(torch.int32)

predictions = model.predict({'prompt': text})
out = traced_model(text)

print("PyTorch TextEncoder ckpt out for \"a photo of a cat\":\n>>>", out[0][0, :10])
print("\nCoreML TextEncoder ckpt out for \"a photo of a cat\":\n>>>", predictions['embOutput'][0, :10])

3. 图片搜索 APP

将 CLIP 模型分别导出为 Image Encoder 和 Text Encoder,在第一次启动 APP 后,加载 Image Encoder 对相册图片进行向量计算,存储到数据库或本地文件缓存。输入文本,通过 Text Encoder 计算文本向量,遍历图片向量数据库,计算向量的余弦相似度和 TopK(n 个数中,找出最大的 k 个数),返回相应的图片数组。

  • Title: CLIP模型在移动端的应用
  • Author: Lulin
  • Created at : 2023-10-31 13:38:19
  • Updated at : 2024-05-13 03:49:29
  • Link: https://blog.lllin.top/2023/10/31/clip/
  • License: This work is licensed under CC BY-NC-SA 4.0.
 Comments