Releases: LessUp/llm-speed
Releases · LessUp/llm-speed
v0.3.0 - Bilingual Documentation & Bug Fixes
Summary
This release introduces comprehensive bilingual documentation (English/Chinese) and critical bug fixes for CUDA kernels.
Documentation
- New: Complete bilingual documentation structure (EN/ZH)
- New: Quick Start guides in both languages
- New: Troubleshooting guides with common solutions
- Improved: API reference with detailed parameter descriptions
- Improved: Architecture documentation with technical deep dive
- Fixed: Broken documentation links across all files
- Fixed: Duplicate sections in README.md
CUDA Kernel Bug Fixes
- Critical: Fixed division by zero in FlashAttention rescale factor
- Critical: Fixed rescale calculation in Tiled Attention for first block
- High: Fixed softmax overflow in Naive Attention for all-masked case
- High: Added static assertion for BLOCK_SIZE validation in warp primitives
Performance
| Kernel | Memory | Best For |
|---|---|---|
| FlashAttention | O(N) | Long sequences (2K+) |
| Tiled Attention | O(N²) | Medium sequences |
| Tensor Core GEMM | - | 95%+ cuBLAS performance |
What's Changed
- Documentation structure optimization and bug fixes @shane
Full Changelog: v0.2.0...v0.3.0
更新摘要
本次发布引入了完整的双语文档(中英文)以及 CUDA 内核的关键 bug 修复。
文档
- 新增: 完整的双语文档结构(中英文)
- 新增: 两种语言的快速入门指南
- 新增: 常见问题解决方案的故障排除指南
- 改进: 详细的 API 参考文档
- 改进: 技术深度解析的架构文档
- 修复: 所有文件中的文档链接
- 修复: README.md 中的重复章节
CUDA 内核 Bug 修复
- 关键: 修复 FlashAttention 重缩放因子中的除零问题
- 关键: 修复 Tiled Attention 第一个块的重缩放计算
- 高: 修复 Naive Attention 全掩码情况的 softmax 溢出
- 高: 添加 warp primitives 中 BLOCK_SIZE 验证的静态断言
变更内容
- 文档结构优化和 bug 修复 @shane
完整变更日志: v0.2.0...v0.3.0
v0.2.0 - Documentation & CI Enhancement
Summary
This release focuses on comprehensive documentation restructuring, CI/CD improvements, and code quality enhancements.
Documentation
- New: API Reference (
docs/api.md) - Complete API documentation with examples - New: Performance Guide (
docs/performance.md) - Hardware requirements, benchmarking, and optimization tips - Improved: Technical Deep Dive (
docs/deepwiki.md) - Restructured with architecture diagrams and optimization roadmap - Improved: CONTRIBUTING.md - Added quick reference tables and detailed workflow
- Improved: CLAUDE.md - Added architecture overview and common tasks section
Git Pages
- New: Custom Jekyll layout (
_layouts/default.html) with responsive design - Improved: Navigation bar with links to all documentation sections
- Improved: SEO meta tags and social media integration
- Improved: Documentation homepage (
index.md) with quick start guide
CHANGELOG
- Restructured: Adopted Keep a Changelog format
- Added: Version tracking with comparison links
- Added: Migration guide for users
- Removed: Scattered changelog files, consolidated into single
CHANGELOG.md
CI/CD
- Improved: Separated lint, test, and docs jobs in CI workflow
- Improved: Added YAML validation step
- Improved: Better error handling in test execution
- Improved: Path-based filtering for Pages deployment
Code Quality
- Fixed: Python code formatting (ruff format) across all files
- Fixed: Divide-by-zero protection in CUDA kernels (
naive_attention.cu,flash_attention.cu) - Fixed: Integer overflow in GEMM index calculations (changed to
int64_t) - Added: Empty tensor validation in Python bindings
Specifications
- Improved: Requirements document with REQ-1 to REQ-8 specifications
- Improved: Tasks document with Phase grouping and dependency graph
- Improved: Design document with kernel specifications and shared memory layouts
What's Changed
- Comprehensive documentation restructure and optimization @shane
Full Changelog: v0.1.0...v0.2.0
更新摘要
本次发布主要聚焦于文档重构、CI/CD 改进和代码质量提升。
文档
- 新增: API 参考文档 (
docs/api.md) - 完整的 API 文档和示例 - 新增: 性能调优指南 (
docs/performance.md) - 硬件要求、基准测试和优化建议 - 改进: 技术深潜文档 (
docs/deepwiki.md) - 添加架构图和优化路线图 - 改进: 贡献指南 - 添加快速参考表和详细工作流程
- 改进: CLAUDE.md - 添加架构概览和常见任务说明
Git Pages
- 新增: 自定义 Jekyll 布局 (
_layouts/default.html),支持响应式设计 - 改进: 导航栏,链接到所有文档部分
- 改进: SEO 元标签和社交媒体集成
- 改进: 文档首页 (
index.md),添加快速开始指南
CHANGELOG
- 重构: 采用 Keep a Changelog 格式
- 新增: 版本追踪和比较链接
- 新增: 用户迁移指南
- 移除: 分散的变更日志文件,合并为单一
CHANGELOG.md
CI/CD
- 改进: CI workflow 分离 lint、test、docs 三个任务
- 改进: 添加 YAML 验证步骤
- 改进: 更好的测试执行错误处理
- 改进: Pages 部署的路径过滤
代码质量
- 修复: Python 代码格式化(ruff format)
- 修复: CUDA 内核除零保护
- 修复: GEMM 索引计算整数溢出(改为
int64_t) - 新增: Python 绑定中的空张量验证
规格文档
- 改进: 需求文档,添加 REQ-1 到 REQ-8 规范
- 改进: 任务文档,添加阶段分组和依赖关系图
- 改进: 设计文档,添加内核规格和共享内存布局
变更内容
- 文档重构和优化 @shane
完整变更日志: v0.1.0...v0.2.0