Skip to content

Various improvements to the docs#3030

Merged
maleadt merged 13 commits intoJuliaGPU:masterfrom
giordano:mg/docs
Apr 9, 2026
Merged

Various improvements to the docs#3030
maleadt merged 13 commits intoJuliaGPU:masterfrom
giordano:mg/docs

Conversation

@giordano
Copy link
Copy Markdown
Contributor

I had some...uhm...fun in the last couple of days trying to port some C++ CUDA code to CUDA.jl, and profile it. I dumped into this PR my experience, hoping to make lives of people after me a little bit easier 🙂

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 13, 2026

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index 5e6209fe3..dd9655911 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -92,62 +92,62 @@ end
 @doc """
     threadIdx()::NamedTuple
 
-Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
 """ threadIdx
 @inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
 
 @doc """
     blockDim()::NamedTuple
 
-Returns the dimensions (in threads) of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-Unlike the `*Idx` intrinsics, `blockDim` returns the same value as its C/C++ extension counterpart.
+    Returns the dimensions (in threads) of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Unlike the `*Idx` intrinsics, `blockDim` returns the same value as its C/C++ extension counterpart.
 """ blockDim
 @inline blockDim() = (x=blockDim_x(), y=blockDim_y(), z=blockDim_z())
 
 @doc """
     blockIdx()::NamedTuple
 
-Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
 """ blockIdx
 @inline blockIdx() = (x=blockIdx_x(), y=blockIdx_y(), z=blockIdx_z())
 
 @doc """
     gridDim()::NamedTuple
 
-Returns the dimensions (in blocks) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-Unlike the `*Idx` intrinsics, `gridDim` returns the same value as its C/C++ extension counterpart.
+    Returns the dimensions (in blocks) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Unlike the `*Idx` intrinsics, `gridDim` returns the same value as its C/C++ extension counterpart.
 """ gridDim
 @inline gridDim() = (x=gridDim_x(), y=gridDim_y(), z=gridDim_z())
 
 @doc """
     blockIdxInCluster()::NamedTuple
 
-Returns the block index within the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based.
+    Returns the block index within the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based.
 """ blockIdxInCluster
 @inline blockIdxInCluster() = (x=blockIdxInCluster_x(), y=blockIdxInCluster_y(), z=blockIdxInCluster_z())
 
 @doc """
     clusterDim()::NamedTuple
 
-Returns the dimensions (in blocks) of the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Returns the dimensions (in blocks) of the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
 """ clusterDim
 @inline clusterDim() = (x=clusterDim_x(), y=clusterDim_y(), z=clusterDim_z())
 
 @doc """
     clusterIdx()::NamedTuple
 
-Returns the cluster index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based.
+    Returns the cluster index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based.
 """ clusterIdx
 @inline clusterIdx() = (x=clusterIdx_x(), y=clusterIdx_y(), z=clusterIdx_z())
 
 @doc """
     gridClusterDim()::NamedTuple
 
-Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
 """ gridClusterDim
 @inline gridClusterDim() = (x=gridClusterDim_x(), y=gridClusterDim_y(), z=gridClusterDim_z())
 
@@ -155,7 +155,7 @@ Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`
     linearBlockIdxInCluster()::Int32
 
 Returns the linear block index within the cluster.
-These indices are 1-based.
+    These indices are 1-based.
 """ linearBlockIdxInCluster
 @eval @inline $(:linearBlockIdxInCluster)() = _index($(Val(Symbol("cluster.ctarank"))), $(Val(0:max_cluster_length-1))) + 1i32
 
@@ -170,7 +170,7 @@ Returns the linear cluster size (in blocks).
     warpsize()::Int32
 
 Returns the warp size (in threads).
-This corresponds to the `warpSize` built-in variable in the C/C++ extension.
+    This corresponds to the `warpSize` built-in variable in the C/C++ extension.
 """ warpsize
 @inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
 
@@ -178,7 +178,7 @@ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
     laneid()::Int32
 
 Returns the thread's lane within the warp.
-This ID is 1-based.
+    This ID is 1-based.
 """ laneid
 @inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32
 

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 69236fb Previous: e260b92 Ratio
latency/precompile 4073991756 ns 4060336008 ns 1.00
latency/ttfp 14097056796 ns 14188556469 ns 0.99
latency/import 3572377951 ns 3555441335 ns 1.00
integration/volumerhs 9444918.5 ns 9440665.5 ns 1.00
integration/byval/slices=1 145945 ns 145820 ns 1.00
integration/byval/slices=3 422989.5 ns 422996 ns 1.00
integration/byval/reference 143993 ns 143940 ns 1.00
integration/byval/slices=2 284659 ns 284595 ns 1.00
integration/cudadevrt 102612 ns 102603 ns 1.00
kernel/indexing 13380 ns 13331 ns 1.00
kernel/indexing_checked 13998.5 ns 14078 ns 0.99
kernel/occupancy 664.1677018633541 ns 692.6139240506329 ns 0.96
kernel/launch 2244.4444444444443 ns 2098.5555555555557 ns 1.07
kernel/rand 14734 ns 15622 ns 0.94
array/reverse/1d 18638 ns 18269 ns 1.02
array/reverse/2dL_inplace 66108 ns 66029 ns 1.00
array/reverse/1dL 69150 ns 68802 ns 1.01
array/reverse/2d 21119 ns 20617 ns 1.02
array/reverse/1d_inplace 8636 ns 10283.666666666666 ns 0.84
array/reverse/2d_inplace 10351 ns 10353 ns 1.00
array/reverse/2dL 73052.5 ns 72617 ns 1.01
array/reverse/1dL_inplace 66004 ns 65907 ns 1.00
array/copy 18680 ns 18749 ns 1.00
array/iteration/findall/int 150479 ns 149387.5 ns 1.01
array/iteration/findall/bool 132426 ns 132253.5 ns 1.00
array/iteration/findfirst/int 84111 ns 83271.5 ns 1.01
array/iteration/findfirst/bool 81929 ns 81441 ns 1.01
array/iteration/scalar 67243 ns 69131 ns 0.97
array/iteration/logical 201055.5 ns 199952 ns 1.01
array/iteration/findmin/1d 90426 ns 86816.5 ns 1.04
array/iteration/findmin/2d 118065.5 ns 117208 ns 1.01
array/reductions/reduce/Int64/1d 43892 ns 43408 ns 1.01
array/reductions/reduce/Int64/dims=1 46449.5 ns 43024 ns 1.08
array/reductions/reduce/Int64/dims=2 60070 ns 59829 ns 1.00
array/reductions/reduce/Int64/dims=1L 87812 ns 87729 ns 1.00
array/reductions/reduce/Int64/dims=2L 84937 ns 84578 ns 1.00
array/reductions/reduce/Float32/1d 35205 ns 35224 ns 1.00
array/reductions/reduce/Float32/dims=1 48235.5 ns 40532 ns 1.19
array/reductions/reduce/Float32/dims=2 57313 ns 56836 ns 1.01
array/reductions/reduce/Float32/dims=1L 52043 ns 51874 ns 1.00
array/reductions/reduce/Float32/dims=2L 69923 ns 69617.5 ns 1.00
array/reductions/mapreduce/Int64/1d 43350 ns 43343 ns 1.00
array/reductions/mapreduce/Int64/dims=1 43301.5 ns 42594 ns 1.02
array/reductions/mapreduce/Int64/dims=2 60015 ns 59634 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 88049.5 ns 87814 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 84985 ns 84815 ns 1.00
array/reductions/mapreduce/Float32/1d 34814 ns 34828 ns 1.00
array/reductions/mapreduce/Float32/dims=1 40509.5 ns 39897 ns 1.02
array/reductions/mapreduce/Float32/dims=2 57186 ns 56752 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 51853 ns 51768.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69785 ns 69310 ns 1.01
array/broadcast 20787 ns 20615 ns 1.01
array/copyto!/gpu_to_gpu 11358 ns 11301 ns 1.01
array/copyto!/cpu_to_gpu 218837 ns 216699 ns 1.01
array/copyto!/gpu_to_cpu 284192 ns 284359.5 ns 1.00
array/accumulate/Int64/1d 118889 ns 118782 ns 1.00
array/accumulate/Int64/dims=1 80273 ns 80255 ns 1.00
array/accumulate/Int64/dims=2 156141 ns 156856 ns 1.00
array/accumulate/Int64/dims=1L 1705695 ns 1704288.5 ns 1.00
array/accumulate/Int64/dims=2L 961718.5 ns 961419 ns 1.00
array/accumulate/Float32/1d 101456.5 ns 101642 ns 1.00
array/accumulate/Float32/dims=1 77050 ns 76595 ns 1.01
array/accumulate/Float32/dims=2 144136 ns 144764 ns 1.00
array/accumulate/Float32/dims=1L 1587086 ns 1593525 ns 1.00
array/accumulate/Float32/dims=2L 660874 ns 660030 ns 1.00
array/construct 1277.5 ns 1287.9 ns 0.99
array/random/randn/Float32 43764 ns 43834 ns 1.00
array/random/randn!/Float32 31585 ns 27591 ns 1.14
array/random/rand!/Int64 33711 ns 27841 ns 1.21
array/random/rand!/Float32 8674.333333333334 ns 8461 ns 1.03
array/random/rand/Int64 37472 ns 30522.5 ns 1.23
array/random/rand/Float32 13421 ns 13025 ns 1.03
array/permutedims/4d 53048.5 ns 52112.5 ns 1.02
array/permutedims/2d 53071 ns 52576 ns 1.01
array/permutedims/3d 53518 ns 52685 ns 1.02
array/sorting/1d 2735142 ns 2744009 ns 1.00
array/sorting/by 3328331.5 ns 3314220 ns 1.00
array/sorting/2d 1072830 ns 1071845 ns 1.00
cuda/synchronization/stream/auto 1066.1 ns 1071.4 ns 1.00
cuda/synchronization/stream/nonblocking 8135 ns 8252.9 ns 0.99
cuda/synchronization/stream/blocking 850.7313432835821 ns 852.530303030303 ns 1.00
cuda/synchronization/context/auto 1197.7 ns 1205.3 ns 0.99
cuda/synchronization/context/nonblocking 7182.5 ns 8066.9 ns 0.89
cuda/synchronization/context/blocking 930.7941176470588 ns 931.074074074074 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.58%. Comparing base (e260b92) to head (69236fb).
⚠️ Report is 13 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3030   +/-   ##
=======================================
  Coverage   90.58%   90.58%           
=======================================
  Files         134      134           
  Lines       11637    11637           
=======================================
  Hits        10541    10541           
  Misses       1096     1096           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/device/intrinsics/indexing.jl Outdated
""" threadIdx
@inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gridDim returns a dimension/size, not an index.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced "index" with "dimension" here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starting dimension doesn't make much sense to me. What else could a size() query return? 0 vs 1-based indexing doesn't apply here.

That said, I'm okay with this if you think this clarifies things.

Copy link
Copy Markdown
Member

@christiangnrd christiangnrd Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be phrased along the lines of:

Unlike the `*Idx` intrinsics `gridDim` returns the same value as its C/C++ extension counterpart.

I do think this should be mentioned in form though. The indexing intrinsics being offset while the dim intrinsics not makes sense when you think about it, but I've also gotten confused by this, and not everyone will think/know to check the source code to confirm.

Either way, the same edits the gridDim receives should also be mirrored to blockDim

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@giordano
Copy link
Copy Markdown
Contributor Author

Bump? I expanded also doscstrings introduced in #3017.

@giordano
Copy link
Copy Markdown
Contributor Author

Bump. This PR keeps running into conflicts with other PRs which are merged in the meantime... 🫠

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Apr 9, 2026

Sorry, forgot about this. Thanks!

@maleadt maleadt merged commit 5f45772 into JuliaGPU:master Apr 9, 2026
2 checks passed
@giordano giordano deleted the mg/docs branch April 9, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants