CHANGELOG

v1.3.0

Added support for Qwen3.5, Gemma4, and SmolLM3 models.
Optimized the multimodal input interface and cache reuse strategy.
Added support for multiple EOS token IDs and introduced the ignore_eos_token parameter.
Optimized performance on 32-bit systems.
Added support for tokenizer and embedding callbacks.
Improved long-context decoding performance for certain models on the RK3576 platform.
Optimized the quantization method for embedding input data.
Fixed memory usage statistics issues on the RV1126B platform.
Fixed numerical overflow issues during inference for certain models on the RK3588 platform.
Improved rkllm_server_demo compatibility with OpenAI API interfaces.
Added support for overriding max_new_tokens and sampling parameters in RKLLMInferParam

Added support for RWKV7, Qwen3, and MiniCPM4 models
Added support for the RV1126B platform
Enabled function calling capability
Enabled cross-attention inference
Optimize the callback function to support pausing inference
Supported multi-batch inference
Optimized KV cache clearing interface
Improved chat template parsing with support for thinking mode selection
Server demo updated to support OpenAI-compatible format
Added return of model inference performance statistics
Supported mrope multimodal position encoding
A new quantization optimization algorithm has been added to improve quantization accuracy

Supports custom model conversion.
Supports chat_template configuration.
Enables multi-turn dialogue interactions.
Implements automatic prompt cache reuse for improved inference efficiency.
Expands maximum context length to 16K.
Supports embedding flash storage to reduce memory usage.
Introduces the GRQ Int4 quantization algorithm.
Supports GPTQ-Int8 model conversion.
Compatible with the RK3562 platform.
Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
Supports CPU core configuration.
Added support for Gemma3
Added support for Python 3.9/3.11/3.12

Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
Support joint inference with LoRA model loading
Support storage and preloading of prompt cache.
Support gguf model conversion (currently only support q4_0 and fp16).
Optimize initialization, prefill, and decode time.
Support four input types: prompt, embedding, token, and multimodal.
Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
Add gdq algorithm to improve 4-bit quantization accuracy.
Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
Add support for models such as Llama3, Gemma2, and MiniCPM3.
Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.