Cross-Device End-Side Generative AI Multi-Modal Benchmarking with Nexa Compressed Inference

AI News4mos agorelease Sharenet.ai
643 0
吐司AI

Executive Summary

Nexa Native Inference Framework makes the deployment of generative AI models on the device side seamless and efficient. The technology supports a wide range of chipsets including AMD, Qualcomm, Intel, NVIDIA, and homegrown chips, and is compatible with all major operating systems. We provide benchmark data for generative AI models on a variety of common tasks, each tested at TOPS performance level on different types of devices.

Core strengths:

  1. multimodal capability - be in favor ofText, audio, video and visualGenerative AI-like tasks
  2. Wide range of hardware compatibility - Runs AI models on PCs, laptops, mobile devices, and embedded systems
  3. leading performance - With our edge inference framework, NexaQuant, models run 2.5x faster and storage and memory requirements are reduced by 4x, while maintaining high accuracy
跨设备端侧生成式 AI 多模态基准测试与 Nexa 压缩推理技术

Why end-side AI?

Deploying AI models directly on the device side has several advantages over relying on cloud APIs:

  • Privacy and Security - Data retention on the device side ensures confidentiality
  • reduce costs - No need to pay for expensive cloud-based reasoning
  • Speed and Response - Low-latency inference without relying on the network
  • offline capability - AI applications can still be used in low connectivity areas

With Nexa edge inference technology, developers can efficiently run generative AI models on a wide range of devices while minimizing resource consumption.

New Trends in Multimodal AI Applications

Nexa AI End-side deployment supportMultimodal AI, enabling applications to handle and integrate multiple data types:

  • Text AI - Chatbots, document summarization, programming assistants
  • Speech to Speech AI - Real-time voice translation, AI voice assistant
  • Visual AI - Target detection, image description, document OCR processing

This is accomplished through the use ofNexaQuantOur multimodal models achieve excellent compression and acceleration while maintaining top performance.

Cross-Device Generative AI Task Performance Benchmarks

We provide benchmarking data for generative AI models on a variety of common tasks, each tested at the TOPS performance level on different types of devices. If you have a specific device and target use case, you can refer to similarly performing devices to estimate processing power:

Generative AI tasks covered:

  • Voice to Voice
  • Text to Text
  • Visual to text

Covered device types:

  • Modern Notebook Chips - Optimized for desktop and laptop native AI processing
  • flagship mobile chip - AI models running on smartphones and tablets
  • embedded system (~4 TOPS) - Low Power Devices for Edge Computing Applications

Speech-to-speech benchmarking

Evaluating Real-Time Speech Interaction Capabilities with Language Models - ProcessingAudio input generates audio output

Equipment typeChips & DevicesDelay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU0.67 seconds20.46 tokens/second~990MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)1.01 seconds19.28 tokens/second~990MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V1.89 seconds11.88 tokens/second~990MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)1.45 seconds9.13 token/second~990MB
Embedded IoT System CPURaspberry Pi 4 Model B6.9 seconds4.5 token/second~990MB

Speech-to-Speech Benchmarking Using Moshi with NexaQuant

Text-to-text benchmarking

valuationGenerate text based on text inputAI model performance

Equipment typeChips & DevicesInitial Delay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU0.12 seconds49.01 token/second~2580MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)0.19 seconds30.54 tokens/second~2580MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V0.63 seconds14.35 tokens/second~2580MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)0.27 seconds10.89 tokens/second~2580MB
Embedded IoT System CPURaspberry Pi 4 Model B1.27 seconds5.31 token/second~2580MB

Text-to-text benchmarking using llama-3.2 with NexaQuant

Visual-to-text benchmarking

Evaluating AI Analyzing Visual InputsThe ability to generate responses, extract key visual information, and dynamic guidance tools -Visual Input, Text Output

Equipment typeChips & DevicesInitial Delay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU2.62 seconds86.77 tokens/second~1093MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)2.14 seconds83.41 tokens/second~1093MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V9.43 seconds45.65 tokens/second~1093MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)7.26 seconds.27.66 tokens/second~1093MB
Embedded IoT System CPURaspberry Pi 4 Model B22.32 seconds6.15 tokens/second~1093MB

Visual-to-Text Benchmarking Using OmniVLM with NexaQuant

© Copyright notes
AiPPT

Related posts

No comments

none
No comments...