Skip to content

Releases: LessUp/llm-speed

v0.3.0 - Bilingual Documentation & Bug Fixes

21 Apr 19:34

Choose a tag to compare

Summary

This release introduces comprehensive bilingual documentation (English/Chinese) and critical bug fixes for CUDA kernels.


Documentation

  • New: Complete bilingual documentation structure (EN/ZH)
  • New: Quick Start guides in both languages
  • New: Troubleshooting guides with common solutions
  • Improved: API reference with detailed parameter descriptions
  • Improved: Architecture documentation with technical deep dive
  • Fixed: Broken documentation links across all files
  • Fixed: Duplicate sections in README.md

CUDA Kernel Bug Fixes

  • Critical: Fixed division by zero in FlashAttention rescale factor
  • Critical: Fixed rescale calculation in Tiled Attention for first block
  • High: Fixed softmax overflow in Naive Attention for all-masked case
  • High: Added static assertion for BLOCK_SIZE validation in warp primitives

Performance

Kernel Memory Best For
FlashAttention O(N) Long sequences (2K+)
Tiled Attention O(N²) Medium sequences
Tensor Core GEMM - 95%+ cuBLAS performance

What's Changed

  • Documentation structure optimization and bug fixes @shane

Full Changelog: v0.2.0...v0.3.0


更新摘要

本次发布引入了完整的双语文档(中英文)以及 CUDA 内核的关键 bug 修复。


文档

  • 新增: 完整的双语文档结构(中英文)
  • 新增: 两种语言的快速入门指南
  • 新增: 常见问题解决方案的故障排除指南
  • 改进: 详细的 API 参考文档
  • 改进: 技术深度解析的架构文档
  • 修复: 所有文件中的文档链接
  • 修复: README.md 中的重复章节

CUDA 内核 Bug 修复

  • 关键: 修复 FlashAttention 重缩放因子中的除零问题
  • 关键: 修复 Tiled Attention 第一个块的重缩放计算
  • : 修复 Naive Attention 全掩码情况的 softmax 溢出
  • : 添加 warp primitives 中 BLOCK_SIZE 验证的静态断言

变更内容

  • 文档结构优化和 bug 修复 @shane

完整变更日志: v0.2.0...v0.3.0

v0.2.0 - Documentation & CI Enhancement

16 Apr 02:22

Choose a tag to compare

Summary

This release focuses on comprehensive documentation restructuring, CI/CD improvements, and code quality enhancements.


Documentation

  • New: API Reference (docs/api.md) - Complete API documentation with examples
  • New: Performance Guide (docs/performance.md) - Hardware requirements, benchmarking, and optimization tips
  • Improved: Technical Deep Dive (docs/deepwiki.md) - Restructured with architecture diagrams and optimization roadmap
  • Improved: CONTRIBUTING.md - Added quick reference tables and detailed workflow
  • Improved: CLAUDE.md - Added architecture overview and common tasks section

Git Pages

  • New: Custom Jekyll layout (_layouts/default.html) with responsive design
  • Improved: Navigation bar with links to all documentation sections
  • Improved: SEO meta tags and social media integration
  • Improved: Documentation homepage (index.md) with quick start guide

CHANGELOG

  • Restructured: Adopted Keep a Changelog format
  • Added: Version tracking with comparison links
  • Added: Migration guide for users
  • Removed: Scattered changelog files, consolidated into single CHANGELOG.md

CI/CD

  • Improved: Separated lint, test, and docs jobs in CI workflow
  • Improved: Added YAML validation step
  • Improved: Better error handling in test execution
  • Improved: Path-based filtering for Pages deployment

Code Quality

  • Fixed: Python code formatting (ruff format) across all files
  • Fixed: Divide-by-zero protection in CUDA kernels (naive_attention.cu, flash_attention.cu)
  • Fixed: Integer overflow in GEMM index calculations (changed to int64_t)
  • Added: Empty tensor validation in Python bindings

Specifications

  • Improved: Requirements document with REQ-1 to REQ-8 specifications
  • Improved: Tasks document with Phase grouping and dependency graph
  • Improved: Design document with kernel specifications and shared memory layouts

What's Changed

  • Comprehensive documentation restructure and optimization @shane

Full Changelog: v0.1.0...v0.2.0


更新摘要

本次发布主要聚焦于文档重构、CI/CD 改进和代码质量提升。


文档

  • 新增: API 参考文档 (docs/api.md) - 完整的 API 文档和示例
  • 新增: 性能调优指南 (docs/performance.md) - 硬件要求、基准测试和优化建议
  • 改进: 技术深潜文档 (docs/deepwiki.md) - 添加架构图和优化路线图
  • 改进: 贡献指南 - 添加快速参考表和详细工作流程
  • 改进: CLAUDE.md - 添加架构概览和常见任务说明

Git Pages

  • 新增: 自定义 Jekyll 布局 (_layouts/default.html),支持响应式设计
  • 改进: 导航栏,链接到所有文档部分
  • 改进: SEO 元标签和社交媒体集成
  • 改进: 文档首页 (index.md),添加快速开始指南

CHANGELOG

  • 重构: 采用 Keep a Changelog 格式
  • 新增: 版本追踪和比较链接
  • 新增: 用户迁移指南
  • 移除: 分散的变更日志文件,合并为单一 CHANGELOG.md

CI/CD

  • 改进: CI workflow 分离 lint、test、docs 三个任务
  • 改进: 添加 YAML 验证步骤
  • 改进: 更好的测试执行错误处理
  • 改进: Pages 部署的路径过滤

代码质量

  • 修复: Python 代码格式化(ruff format)
  • 修复: CUDA 内核除零保护
  • 修复: GEMM 索引计算整数溢出(改为 int64_t
  • 新增: Python 绑定中的空张量验证

规格文档

  • 改进: 需求文档,添加 REQ-1 到 REQ-8 规范
  • 改进: 任务文档,添加阶段分组和依赖关系图
  • 改进: 设计文档,添加内核规格和共享内存布局

变更内容

  • 文档重构和优化 @shane

完整变更日志: v0.1.0...v0.2.0