When developing CUDA kernels, one of the fundamental challenges developers face is ensuring threads don’t access memory outside the bounds of their data structures. While C++ offers elegant solutions through virtual inheritance and polymorphism, CUDA’s device code has significant limitations that force us to find creative workarounds. Let’s explore a practical example that demonstrates both the problem and a clever solution. The complete implementation can be found in the cuda-learn repository, with a working example in the mat_gen.cu file.
The Virtual Table Problem in CUDA
In standard C++, we might naturally reach for virtual inheritance to create a clean boundary checking interface:
// This WON'T work in CUDA device code
class BoundaryChecker {
public:
virtual __device__ bool check() const = 0;
virtual __device__ dim3 pos() const = 0;
};
class LinearChecker : public BoundaryChecker {
// Implementation...
};
However, CUDA device code doesn’t support virtual tables (vtables). Virtual function calls require runtime dispatch through function pointers stored in memory, but CUDA’s execution model and memory hierarchy make this problematic. The GPU’s thousands of threads would all need to access the same vtable in memory, creating bottlenecks and synchronization issues that would devastate performance.
A Strategy Pattern Solution
The code in boundaries.h presents an elegant workaround using a manual implementation of the strategy pattern. Instead of relying on virtual inheritance, it uses three concrete strategy classes and a unified interface that switches between them at runtime.
The Strategy Classes
The implementation provides three boundary checking strategies:
LinearCheckStrategy – For one-dimensional arrays:
class LinearCheckStragegy {
public:
__device__ bool check() const {
return (pos().x < size_);
}
__device__ dim3 pos() const {
return dim3{blockIdx.x * blockDim.x + threadIdx.x, 0, 0};
}
private:
int size_;
};
RectangularCheckStrategy – For 2D matrices:
class RectangularCheckStrategy {
public:
__device__ bool check() const {
auto p = pos();
return (p.y < rows_ && p.x < cols_);
}
__device__ dim3 pos() const {
return dim3{(blockIdx.x * blockDim.x + threadIdx.x),
(blockIdx.y * blockDim.y + threadIdx.y), 0};
}
private:
uint rows_, cols_;
};
CubeCheckStrategy – For 3D volumes:
class CubeCheckStrategy {
public:
__device__ bool check() const {
auto curPos = pos();
return (curPos.x < size_.x && curPos.y < size_.y && curPos.z < size_.z);
}
__device__ dim3 pos() const {
return {(blockIdx.x * blockDim.x + threadIdx.x),
(blockIdx.y * blockDim.y + threadIdx.y),
(blockIdx.z * blockDim.z + threadIdx.z)};
}
private:
dim3 size_;
};
The Unified Interface
The SizeCheck class acts as a type-erased wrapper that can work with any of the three strategies:
class SizeCheck {
public:
__device__ explicit SizeCheck(const void *strategy, BoundaryType type) {
switch (type) {
case BoundaryType::Linear:
linStrategy_ = static_cast<const LinearCheckStragegy *>(strategy);
break;
case BoundaryType::Rectangular:
rectStrategy_ = static_cast<const RectangularCheckStrategy *>(strategy);
break;
case BoundaryType::Cube:
cubeStrategy_ = static_cast<const CubeCheckStrategy *>(strategy);
break;
}
}
__device__ bool operator()() const {
if (linStrategy_) {
return linStrategy_->check();
} else if (rectStrategy_) {
return rectStrategy_->check();
} else if (cubeStrategy_) {
return cubeStrategy_->check();
}
return false;
}
};
How to Use This Pattern
Here’s how you would typically use this boundary checking system in a CUDA kernel:
__global__ void myKernel(float* data, LinearCheckStragegy checker) {
SizeCheck sizeCheck(&checker, BoundaryType::Linear);
if (sizeCheck()) {
int idx = sizeCheck.idx();
// Safe to access data[idx]
data[idx] = /* some computation */;
}
}
// Host code
void launchKernel(float* d_data, int size) {
LinearCheckStragegy checker(size);
dim3 blockSize(256);
dim3 gridSize((size + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize>>>(d_data, checker);
}
Real-World Usage: Matrix Generation Example
The mat_gen.cu file in the repository demonstrates a practical application of this boundary checking system for generating random matrices. The implementation creates a CUDA kernel that fills a 2D matrix with normally distributed random values:
__global__
void fillMatrixKernel(float* mat, void* strategy, float mean, float stddev, unsigned long seed) {
SizeCheck checker { strategy, BoundaryType::Rectangular };
if (checker()) {
auto idx = checker.idx();
curandState state;
curand_init(seed, idx, 0, &state);
mat[idx] = curand_normal(&state) * stddev + mean;
}
}
The key aspects of this real-world usage include:
Strategy Allocation: The RectangularCheckStrategy is allocated both on host and device, with explicit memory management:
RectangularCheckStrategy strategy(m, n);
RectangularCheckStrategy* d_strategy;
cudaMalloc(&d_strategy, sizeof(RectangularCheckStrategy));
cudaMemcpy(d_strategy, &strategy, sizeof(RectangularCheckStrategy), cudaMemcpyHostToDevice);
Grid Configuration: The 2D grid is configured to match the matrix dimensions, with careful attention to ensure all matrix elements are covered by the thread grid.
Thread Safety: Each thread gets its own random number generator state, initialized with a unique seed based on the thread’s linear index, ensuring high-quality random number generation across the entire matrix.
This example shows how the boundary checking pattern integrates seamlessly into real CUDA applications, providing both safety and performance.
The Benefits and Trade-offs
This approach offers several advantages:
Performance: No virtual function call overhead – the compiler can often inline the strategy calls completely.
Flexibility: Easy to add new boundary checking strategies without changing existing code.
Type Safety: Compile-time checking ensures you’re using the right strategy for your data structure.
However, there are trade-offs:
Code Complexity: More verbose than simple virtual inheritance would be.
Runtime Overhead: The switch statement and null pointer checks add some overhead, though it’s typically minimal compared to memory access costs.
Memory Usage: Each SizeCheck instance stores three pointers, even though only one is used.
Why This Matters
This pattern demonstrates a broader principle in CUDA development: sometimes the most elegant C++ solutions don’t translate directly to GPU code. Understanding these limitations and developing workarounds is crucial for writing efficient CUDA applications.
The boundary checking problem is ubiquitous in GPU computing – nearly every kernel needs to ensure threads don’t access invalid memory. Having a reusable, efficient solution like this can significantly improve both code quality and development velocity.
Conclusion
While CUDA’s lack of virtual table support might seem limiting, it forces us to think more carefully about performance and often leads to more efficient solutions. The boundary checking implementation shown here is a perfect example of turning a limitation into an opportunity for better design.
This pattern can be extended beyond boundary checking to any situation where you need polymorphic behavior in CUDA device code. By understanding and embracing these constraints, we can write GPU code that’s both performant and maintainable.





Leave a comment