Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions CLI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# tiktoken CLI

Command-line interface for counting tokens in files and directories.

## Installation

After installing tiktoken, the `tiktoken` command will be available:

```bash
pip install tiktoken
```

## Usage

### Basic Token Counting

Count tokens in a single file:
```bash
tiktoken count file.txt
```

Output:
```
42
```

### Using Specific Models

Count tokens using a specific model's encoding:
```bash
tiktoken count --model gpt-4o document.txt
tiktoken count --model gpt-4-turbo code.py
```

### Directory Operations

Count tokens in all files in a directory:
```bash
tiktoken count --recursive ./src/
```

Use glob patterns to filter files:
```bash
tiktoken count --glob "*.py" ./project/
tiktoken count --recursive --glob "*.md" ./docs/
```

### Output Formats

#### JSON Output
```bash
tiktoken count --json file.txt
```

Output:
```json
{
"summary": {
"total_files": 1,
"total_tokens": 1250,
"total_characters": 5432,
"average_tokens_per_file": 1250
},
"files": [
{
"file": "file.txt",
"tokens": 1250,
"chars": 5432,
"lines": 85
}
]
}
```

#### CSV Output
```bash
tiktoken count --csv ./src/
```

Output:
```csv
file,tokens,characters,lines
src/main.py,450,2100,65
src/utils.py,320,1540,48
src/config.py,180,850,28
```

#### Per-File Breakdown
```bash
tiktoken count --per-file ./src/
```

Output:
```
src/main.py: 450 tokens
src/utils.py: 320 tokens
src/config.py: 180 tokens

Total files: 3
Total tokens: 950
Total characters: 4490
Average tokens per file: 316
```

## Use Cases

### Estimating Context Window Usage

Check if your codebase fits in a model's context window:

```bash
# GPT-4 Turbo has 128k token context
tiktoken count --model gpt-4-turbo --recursive ./my-project/

# Output: Total tokens: 45,230
# Result: Fits comfortably in context window
```

### Cost Estimation

Estimate API costs by counting tokens:

```bash
tiktoken count --json --recursive ./documents/ > token_report.json
# Use the token count to calculate costs based on model pricing
```

### CI/CD Integration

Add token counting to your CI pipeline:

```bash
#!/bin/bash
TOKEN_COUNT=$(tiktoken count --recursive ./src/ | grep "Total tokens" | awk '{print $3}' | tr -d ',')
MAX_TOKENS=50000

if [ $TOKEN_COUNT -gt $MAX_TOKENS ]; then
echo "Error: Codebase exceeds $MAX_TOKENS tokens (found: $TOKEN_COUNT)"
exit 1
fi
```

### Documentation Analysis

Analyze documentation token usage:

```bash
tiktoken count --recursive --glob "*.md" --per-file ./docs/ | tee docs_tokens.txt
```

## Command Reference

### Arguments

- `paths`: One or more files or directories to process

### Options

- `-m, --model MODEL`: Use encoding for specific OpenAI model (e.g., `gpt-4o`, `gpt-4-turbo`)
- `-e, --encoding ENCODING`: Specify encoding directly (default: `o200k_base`)
- `-r, --recursive`: Process directories recursively
- `-g, --glob PATTERN`: Filter files using glob pattern (e.g., `"*.py"`)
- `--json`: Output results as JSON
- `--csv`: Output results as CSV
- `--summary`: Show summary statistics
- `--per-file`: Show per-file token counts

## Examples

### Count tokens in Python files
```bash
tiktoken count --glob "*.py" --recursive ./project/
```

### Generate JSON report for multiple files
```bash
tiktoken count --json file1.txt file2.txt file3.txt > report.json
```

### Check specific model compatibility
```bash
tiktoken count --model gpt-4o --summary ./codebase/
```

### Export to CSV for analysis
```bash
tiktoken count --csv --recursive ./src/ > tokens.csv
```

## Tips

1. **Performance**: The CLI processes files quickly thanks to tiktoken's fast Rust implementation
2. **Binary Files**: Binary files are automatically skipped
3. **Large Directories**: Use `--glob` to filter files and speed up processing
4. **Shell Integration**: Pipe output to other tools for further processing

## Troubleshooting

**Error: "No files found to process"**
- Check your glob pattern syntax
- Ensure files exist in the specified path
- Use `--recursive` for subdirectories

**Error: "Unknown model 'xyz'"**
- The model name might be incorrect
- Use `--encoding` instead to specify encoding directly
- Check [OpenAI's model documentation](https://platform.openai.com/docs/models) for valid model names

**Binary file warnings**
- The CLI automatically skips binary files
- This is expected behavior and can be ignored
5 changes: 5 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,10 @@
],
package_data={"tiktoken": ["py.typed"]},
packages=["tiktoken", "tiktoken_ext"],
entry_points={
"console_scripts": [
"tiktoken=tiktoken.cli:main",
],
},
zip_safe=False,
)
119 changes: 119 additions & 0 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""
Test suite for tiktoken CLI.

Run with: pytest tests/test_cli.py
"""

import os
import sys
import tempfile
from pathlib import Path

# Add parent directory to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent))

from tiktoken.cli import (
count_tokens_in_text,
count_tokens_in_file,
collect_files,
format_output_json,
format_output_csv,
)


def test_count_tokens_in_text():
"""Test basic token counting."""
text = "Hello, world!"
count = count_tokens_in_text(text, "o200k_base")
assert count > 0
assert isinstance(count, int)


def test_count_tokens_in_file():
"""Test counting tokens in a file."""
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f:
f.write("This is a test file for tiktoken CLI.")
temp_path = f.name

try:
result = count_tokens_in_file(Path(temp_path), "o200k_base")
assert result is not None
assert 'tokens' in result
assert 'chars' in result
assert 'lines' in result
assert result['tokens'] > 0
finally:
os.unlink(temp_path)


def test_collect_files_single_file():
"""Test collecting a single file."""
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f:
temp_path = f.name

try:
files = collect_files([temp_path], False, None)
assert len(files) == 1
assert files[0] == Path(temp_path)
finally:
os.unlink(temp_path)


def test_collect_files_directory():
"""Test collecting files from a directory."""
with tempfile.TemporaryDirectory() as tmpdir:
# Create test files
test_dir = Path(tmpdir)
(test_dir / "file1.txt").write_text("content 1")
(test_dir / "file2.txt").write_text("content 2")

files = collect_files([tmpdir], False, None)
assert len(files) == 2


def test_format_output_json():
"""Test JSON output formatting."""
results = [
{'file': 'test.txt', 'tokens': 100, 'chars': 500, 'lines': 10}
]

output = format_output_json(results)
assert 'summary' in output
assert 'total_tokens' in output
assert '100' in output


def test_format_output_csv():
"""Test CSV output formatting."""
results = [
{'file': 'test.txt', 'tokens': 100, 'chars': 500, 'lines': 10}
]

output = format_output_csv(results)
assert 'file,tokens,characters,lines' in output
assert 'test.txt,100,500,10' in output


if __name__ == '__main__':
# Run basic tests
print("Running tiktoken CLI tests...")

test_count_tokens_in_text()
print("βœ“ test_count_tokens_in_text")

test_count_tokens_in_file()
print("βœ“ test_count_tokens_in_file")

test_collect_files_single_file()
print("βœ“ test_collect_files_single_file")

test_collect_files_directory()
print("βœ“ test_collect_files_directory")

test_format_output_json()
print("βœ“ test_format_output_json")

test_format_output_csv()
print("βœ“ test_format_output_csv")

print("\nβœ… All tests passed!")
Loading