-
ISTA
- Vienna, Austria
- https://blog.panferov.org/
- in/blacksamorez
Highlights
- Pro
Stars
code for "Tying the Loop - Tied Expert Layers in Mixture-of-Experts Language Models"
Python package for LLM compression
An iOS app that integrates a Large Language Model (LLM) to process audio recordings for transcription and summarization.
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Autonomous coding agent as an SDK, IDE extension, or CLI assistant.
Code for the EMNLP 2024 paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…
Friends don't let friends make certain types of data visualization - What are they and why are they bad.
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
Meditron is a suite of open-source medical Large Language Models (LLMs).
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
💎A site, that contains systematic optimization methods and theory review




