⚡️ Speed up method `OutlinesExLlamaV2Tokenizer.decode` by 43% by codeflash-ai[bot] · Pull Request #32 · HeshamHM28/outlines

codeflash-ai · 2025-06-11T22:10:02Z

📄 43% (0.43x) speedup for `OutlinesExLlamaV2Tokenizer.decode` in `outlines/models/exllamav2.py`

⏱️ Runtime : 862 microseconds → 601 microseconds (best of 364 runs)

📝 Explanation and details

Here is a much faster version of your code, especially for the decode function. The major bottleneck is repeated conversion of token_ids into a new torch Tensor even if it's already a Tensor (see torch.tensor(token_ids)). This can be avoided by only converting if needed, using isinstance() with torch.Tensor. This avoids unnecessary memory allocations and data copies, and drastically reduces decode time for large lists or repeat calls.

Also, the unnecessary import statement import torch.LongTensor is removed, and imports are cleaned up.

Here’s the optimized code.

Explanation of changes.

Avoids the redundant torch.tensor(token_ids) call by using torch.as_tensor only if conversion is needed. This prevents unnecessary copies.
Cleans up imports.
All return values remain identical to before.

This should result in a large speed-up for the decode function, especially for already-tensor input, and will use less memory.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 36 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	75.0%

🌀 Generated Regression Tests Details

from typing import List

# imports
import pytest  # used for our unit tests
import torch
from outlines.models.exllamav2 import OutlinesExLlamaV2Tokenizer


# Mock tokenizer to simulate exl2_tokenizer behavior for testing
class MockTokenizer:
    def __init__(self):
        # Simulate a vocabulary and special tokens
        self._piece_to_id = {'hello': 0, 'world': 1, '<eos>': 2, '<pad>': 3, 'foo': 4, 'bar': 5}
        self.extended_piece_to_id = {'<eos>': 2, '<pad>': 3}
        self.eos_token_id = 2

    def get_piece_to_id_dict(self):
        return self._piece_to_id

    def decode(self, token_ids, decode_special_tokens=False):
        # Simulate decoding: map token ids to strings using the vocabulary
        # If token_ids is a 1D tensor, return a string
        # If token_ids is a 2D tensor, return a list of strings
        reverse_vocab = {v: k for k, v in self._piece_to_id.items()}

        if not isinstance(token_ids, torch.Tensor):
            token_ids = torch.tensor(token_ids)

        if token_ids.dim() == 1:
            # Remove special tokens if decode_special_tokens is False
            tokens = [reverse_vocab.get(int(i), '<unk>') for i in token_ids.tolist()]
            if not decode_special_tokens:
                tokens = [t for t in tokens if t not in self.extended_piece_to_id]
            return ' '.join(tokens)
        elif token_ids.dim() == 2:
            # Batch decoding
            results = []
            for row in token_ids:
                tokens = [reverse_vocab.get(int(i), '<unk>') for i in row.tolist()]
                if not decode_special_tokens:
                    tokens = [t for t in tokens if t not in self.extended_piece_to_id]
                results.append(' '.join(tokens))
            return results
        else:
            raise ValueError("Unsupported tensor dimension for decoding.")
from outlines.models.exllamav2 import OutlinesExLlamaV2Tokenizer


# Fixtures for reuse in tests
@pytest.fixture
def tokenizer():
    return OutlinesExLlamaV2Tokenizer(MockTokenizer())

# ------------------- BASIC TEST CASES -------------------

def test_decode_single_token(tokenizer):
    # Test decoding a single token
    codeflash_output = tokenizer.decode(torch.tensor([0])); result = codeflash_output # 17.8μs -> 15.9μs

def test_decode_multiple_tokens(tokenizer):
    # Test decoding multiple tokens
    codeflash_output = tokenizer.decode(torch.tensor([0, 1])); result = codeflash_output # 13.1μs -> 12.0μs

def test_decode_with_special_tokens(tokenizer):
    # Test that special tokens are omitted in output
    codeflash_output = tokenizer.decode(torch.tensor([0, 2, 1, 3])); result = codeflash_output

def test_decode_unknown_token(tokenizer):
    # Test decoding an unknown token id
    codeflash_output = tokenizer.decode(torch.tensor([99])); result = codeflash_output # 11.2μs -> 9.53μs

def test_decode_batch(tokenizer):
    # Test decoding a batch (2D tensor)
    batch = torch.tensor([[0, 1], [4, 5]])
    codeflash_output = tokenizer.decode(batch); result = codeflash_output

# ------------------- EDGE TEST CASES -------------------

def test_decode_empty_tensor(tokenizer):
    # Test decoding an empty tensor
    codeflash_output = tokenizer.decode(torch.tensor([])); result = codeflash_output

def test_decode_all_special_tokens(tokenizer):
    # Test decoding a tensor of only special tokens
    codeflash_output = tokenizer.decode(torch.tensor([2, 3])); result = codeflash_output

def test_decode_tensor_with_mixed_known_and_unknown(tokenizer):
    # Test decoding a tensor with known and unknown tokens
    codeflash_output = tokenizer.decode(torch.tensor([0, 99, 1])); result = codeflash_output

def test_decode_2d_tensor_with_empty_row(tokenizer):
    # Test decoding a 2D tensor with one empty row
    batch = torch.nn.utils.rnn.pad_sequence(
        [torch.tensor([0, 1]), torch.tensor([])],
        batch_first=True, padding_value=3  # pad with special token <pad>
    )
    codeflash_output = tokenizer.decode(batch); result = codeflash_output

def test_decode_tensor_with_only_unknowns(tokenizer):
    # Test decoding a tensor with only unknown tokens
    codeflash_output = tokenizer.decode(torch.tensor([99, 100])); result = codeflash_output

def test_decode_tensor_with_negative_token_ids(tokenizer):
    # Test decoding with negative token ids (should decode to <unk>)
    codeflash_output = tokenizer.decode(torch.tensor([-1, 0, -2])); result = codeflash_output

def test_decode_high_dimensional_tensor_raises(tokenizer):
    # Test that decoding a tensor with more than 2 dimensions raises ValueError
    with pytest.raises(ValueError):
        tokenizer.decode(torch.ones((2,2,2), dtype=torch.long))

# ------------------- LARGE SCALE TEST CASES -------------------

def test_decode_large_1d_tensor(tokenizer):
    # Test decoding a large 1D tensor (length 1000)
    ids = [0, 1, 4, 5] * 250  # 4 * 250 = 1000 tokens
    codeflash_output = tokenizer.decode(torch.tensor(ids)); result = codeflash_output
    expected = 'hello world foo bar ' * 250
    expected = expected.strip()

def test_decode_large_batch(tokenizer):
    # Test decoding a large batch (1000 rows, 4 tokens each)
    batch = torch.tensor([[0, 1, 4, 5]] * 1000)
    codeflash_output = tokenizer.decode(batch); result = codeflash_output

def test_decode_large_batch_with_special_and_unknowns(tokenizer):
    # Test decoding a large batch with special and unknown tokens
    batch = torch.tensor([[0, 2, 99, 1, 3, 100]] * 500)
    codeflash_output = tokenizer.decode(batch); result = codeflash_output

def test_decode_large_sparse_tensor(tokenizer):
    # Test decoding a large sparse tensor (mostly special tokens)
    batch = torch.tensor([[3, 3, 0, 3, 3, 1, 3, 3]] * 200)
    codeflash_output = tokenizer.decode(batch); result = codeflash_output

def test_decode_large_tensor_all_unknown(tokenizer):
    # Test decoding a large tensor of unknown tokens
    ids = [99] * 1000
    codeflash_output = tokenizer.decode(torch.tensor(ids)); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import List

# imports
import pytest  # used for our unit tests
import torch
from outlines.models.exllamav2 import OutlinesExLlamaV2Tokenizer


# Mock tokenizer for testing purposes
class MockTokenizer:
    def __init__(self):
        # Simulate a vocabulary mapping
        self._piece_to_id = {
            "hello": 0,
            "world": 1,
            "!": 2,
            "<eos>": 3,
            "": 4,
            " ": 5,
            "a": 6,
            "b": 7,
            "c": 8,
        }
        # Extended piece to id for special tokens
        self.extended_piece_to_id = {"<eos>", ""}
        self.eos_token_id = 3

    def get_piece_to_id_dict(self):
        return self._piece_to_id.copy()

    def decode(self, token_ids, decode_special_tokens=False):
        # Accepts torch tensors or lists
        if isinstance(token_ids, torch.Tensor):
            token_ids = token_ids.tolist()
        # Handle empty input
        if not token_ids:
            return ""
        # Simulate decoding: map ids to tokens
        id_to_piece = {v: k for k, v in self._piece_to_id.items()}
        decoded = []
        for tid in token_ids:
            if tid in id_to_piece:
                token = id_to_piece[tid]
                # Optionally skip special tokens
                if not decode_special_tokens and token in self.extended_piece_to_id:
                    continue
                decoded.append(token)
            else:
                decoded.append("<unk>")
        # Return string if input was 1D, else list of strings
        return " ".join(decoded)
from outlines.models.exllamav2 import OutlinesExLlamaV2Tokenizer

# unit tests

# ---------- BASIC TEST CASES ----------

def test_decode_single_token():
    # Test decoding a single known token
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0]  # "hello"
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 17.8μs -> 15.9μs

def test_decode_multiple_tokens():
    # Test decoding a sequence of tokens
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0, 1, 2]  # "hello world !"
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 13.1μs -> 12.0μs

def test_decode_with_spaces():
    # Test decoding tokens including a space token
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0, 5, 1]  # "hello  world"
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 12.6μs -> 11.0μs

def test_decode_unknown_token():
    # Test decoding an unknown token id
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [42]  # not in vocab
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 11.2μs -> 9.53μs

def test_decode_mixed_known_and_unknown():
    # Test decoding a mix of known and unknown token ids
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0, 42, 1]  # "hello <unk> world"
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 12.3μs -> 10.7μs

# ---------- EDGE TEST CASES ----------

def test_decode_empty_list():
    # Test decoding an empty list of token ids
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = []
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 8.43μs -> 7.36μs

def test_decode_only_special_tokens():
    # Test decoding a sequence of only special tokens (should be skipped)
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [3, 4]  # <eos>, 
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 12.5μs -> 11.1μs

def test_decode_special_tokens_and_normal():
    # Test decoding with special tokens and normal tokens intermixed
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [3, 0, 4, 1]  # <eos> hello  world
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 13.0μs -> 10.8μs


def test_decode_tensor_input():
    # Test decoding with a torch tensor input
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = torch.LongTensor([0, 1, 2])
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 37.3μs -> 7.14μs

def test_decode_negative_token_id():
    # Test decoding with a negative token id (should decode as <unk>)
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [-1]
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 14.6μs -> 13.6μs

def test_decode_large_token_id():
    # Test decoding with a token id much larger than vocab (should decode as <unk>)
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [9999]
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 12.4μs -> 11.2μs


def test_decode_nested_list_input():
    # Test decoding with nested list input (should raise error)
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [[0, 1], [2, 3]]
    with pytest.raises(Exception):
        tokenizer.decode(token_ids)

# ---------- LARGE SCALE TEST CASES ----------

def test_decode_long_sequence():
    # Test decoding a long sequence of valid tokens
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0, 1, 2] * 333  # 999 tokens
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 149μs -> 105μs
    # The expected string is "hello world !" repeated 333 times, joined by spaces
    expected = " ".join(["hello world !"] * 333)

def test_decode_large_with_special_tokens():
    # Test decoding a long sequence with special tokens interspersed
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    # Insert special tokens every 10 tokens
    token_ids = []
    for i in range(1000):
        if i % 10 == 0:
            token_ids.append(3)  # <eos>
        else:
            token_ids.append(0)  # hello
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 150μs -> 103μs
    # Should skip all <eos> tokens, so result is 900 "hello"s
    expected = " ".join(["hello"] * 900)

def test_decode_all_unknown_tokens():
    # Test decoding a long sequence of unknown tokens
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [99] * 1000
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 129μs -> 84.5μs
    expected = " ".join(["<unk>"] * 1000)

def test_decode_large_tensor_input():
    # Test decoding with a large torch tensor input
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = torch.LongTensor([0, 1, 2] * 333 + [0])  # 1000 tokens
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 85.8μs -> 59.8μs
    expected = " ".join(["hello world !"] * 333 + ["hello"])

def test_decode_max_size():
    # Test decoding with the maximum allowed size (1000 tokens)
    tokenizer = OutlinesExLlamaV2Tokenizer(MockTokenizer())
    token_ids = [0] * 1000  # 1000 "hello"
    codeflash_output = tokenizer.decode(token_ids); result = codeflash_output # 147μs -> 100μs
    expected = " ".join(["hello"] * 1000)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-OutlinesExLlamaV2Tokenizer.decode-mbsi6byo and push.

Here is a much faster version of your code, especially for the **decode** function. The major bottleneck is repeated conversion of `token_ids` into a new torch Tensor **even if it's already a Tensor** (see `torch.tensor(token_ids)`). This can be avoided by only converting if needed, using `isinstance()` with `torch.Tensor`. This avoids unnecessary memory allocations and data copies, and drastically reduces decode time for large lists or repeat calls. Also, the unnecessary import statement `import torch.LongTensor` is removed, and imports are cleaned up. Here’s the optimized code. **Explanation of changes**. - Avoids the redundant `torch.tensor(token_ids)` call by using `torch.as_tensor` **only if conversion is needed**. This prevents unnecessary copies. - Cleans up imports. - All return values remain **identical** to before. This should result in a large speed-up for the `decode` function, especially for already-tensor input, and will use less memory.

codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 11, 2025

codeflash-ai Bot requested a review from HeshamHM28 June 11, 2025 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up method `OutlinesExLlamaV2Tokenizer.decode` by 43%#32

⚡️ Speed up method `OutlinesExLlamaV2Tokenizer.decode` by 43%#32
codeflash-ai[bot] wants to merge 1 commit into
mainfrom
codeflash/optimize-OutlinesExLlamaV2Tokenizer.decode-mbsi6byo

codeflash-ai Bot commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

codeflash-ai Bot commented Jun 11, 2025

📄 43% (0.43x) speedup for OutlinesExLlamaV2Tokenizer.decode in outlines/models/exllamav2.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

📄 43% (0.43x) speedup for `OutlinesExLlamaV2Tokenizer.decode` in `outlines/models/exllamav2.py`