MatMul

High-performance BLIS-style matrix multiplication using cache-aware blocking, packed operands, and AVX2 microkernels, designed for low-level performance engineering and hardware-accelerated workloads.

  • 0 Raised
  • 20 Views
  • 0 Judges

Tags

  • Hard Hack

Categories

  • Hard Hack: RISC-V Edition

Description

High-Performance BLIS-Style Matrix Multiplication for Constrained Workloads

Project Summary

This project implements a high performance matrix multiplication (GEMM) engine in Rust, designed for low level performance engineering under constrained execution environments. The system includes multiple implementations such as naive, tiled, vectorized, parallel, BLIS style, and experimental GPU backends, together with a benchmarking harness for empirical evaluation.

The primary contribution is a BLIS style GEMM implementation using cache aware blocking, packed operands, and AVX2 SIMD microkernels, achieving significant speedups over baseline approaches while maintaining correctness and reproducibility.

Repository:

https://github.com/moe-a-m/matmul

Motivation

Matrix multiplication is a foundational kernel in scientific computing, machine learning, and cryptographic workloads. In constrained or decentralized environments, vendor optimized libraries may not be available, making manual performance engineering essential.

This project explores how classical HPC techniques such as blocking, packing, vectorization, and parallelism can be applied in a modern Rust codebase to approach hardware limits while preserving portability and correctness.

Implementation Overview

The codebase provides multiple GEMM variants for comparison:

  • Naive triple-loop baseline

  • Cache-tiled implementation

  • SIMD-vectorized kernel

  • Multi-threaded parallel version

  • BLIS-style implementation with packed panels and microkernels

  • Experimental GPU/XLA backend

The optimized path follows the BLIS five-loop structure, with cache-aware blocking parameters (MC, NC, KC), thread-local packing of matrix panels, and a hand-written AVX2 6×16 FP32 microkernel.

Performance Results

Benchmark configuration:

  • Matrix size: 1024 × 1024 × 1024

  • Precision: FP32

  • Validation: deterministic seed, tolerance 1e-5

VariantGFLOPsSpeedup
Naive~0.68
Tiled~1.1~1.6×
Vectorized~2.9~4.3×
Parallel~3.6~5.3×
BLIS Final~4.16~6.1×

The BLIS-style implementation achieves over six times the performance of the naive baseline while preserving exact numerical correctness.

Roofline Analysis

A roofline model was generated using measured memory bandwidth (~23.2 GB/s) and a conservative peak compute estimate. The optimized kernel operates at high arithmetic intensity (~180 FLOPs/Byte), placing it firmly in the compute-bound region.

Measured performance (~4.16 GFLOPs) indicates that memory bandwidth is not the primary bottleneck; instead, instruction-level efficiency and microkernel utilization dominate performance limits.

Auto-Tuning of Blocking Parameters

Blocking parameters (MC, NC, KC) were empirically tuned to match the cache hierarchy of the target system. Several configurations were benchmarked, and the best result was achieved with:

  • MC = 192

  • KC = 256

  • NC = 256

This configuration balances L1 and L2 cache residency with packing overhead, demonstrating the importance of empirical tuning over fixed heuristics.

Optimization Breakdown

Performance improvements were achieved incrementally through:

  • Loop tiling to improve cache locality

  • SIMD vectorization using AVX2 and FMA

  • Thread-level parallelism with OpenMP

  • BLIS-style packing and microkernels

Each optimization contributed measurable gains, with the final BLIS kernel providing the largest improvement.

Performance Analysis

The final implementation uses BLIS-style cache blocking and packed operands to maximize data reuse and arithmetic intensity. Roofline analysis shows the kernel is compute-bound, achieving approximately 4.16 GFLOPs on a 1024³ FP32 workload. Performance is primarily limited by instruction throughput and register utilization rather than memory bandwidth. These results demonstrate that classical HPC techniques remain highly effective for optimizing constrained and decentralized execution environments.

Performance Analysis with GPU Consideration

The project supports both CPU and GPU execution paths to evaluate tradeoffs in heterogeneous environments. The optimized CPU implementation uses BLIS style blocking, packing, SIMD vectorization, and multi thread parallelism to deliver predictable and reproducible performance with minimal runtime overhead. This makes the CPU backend well suited for latency sensitive workloads, moderate problem sizes, and constrained or decentralized environments where accelerator availability or data movement costs are limiting factors.

The GPU backend provides significantly higher peak throughput due to massive hardware parallelism. However, effective performance depends strongly on workload size and execution context. Kernel launch latency, host device synchronization, and memory transfer overheads reduce utilization for moderate matrix sizes such as 1024 cubed. As a result, GPUs are most advantageous for large batch oriented workloads where computation dominates overhead.

In practice, the choice between CPU and GPU execution should be guided by workload scale, data locality, and system constraints. CPUs offer low overhead and strong determinism, while GPUs excel at sustained high throughput when amortization conditions are met. This project demonstrates that both approaches are complementary, and that informed backend selection is essential for efficient execution.

BackendImplementationGFLOPs or Relative ThroughputSpeedup vs Naive CPUNotes
CPUNaive~0.68 GFLOPs1xBaseline triple loop
CPUTiled~1.1 GFLOPs~1.6xImproved cache locality
CPUVectorized~2.9 GFLOPs~4.3xAVX2 SIMD utilization
CPUParallel~3.6 GFLOPs~5.3xMulti thread execution
CPUBLIS style optimized~4.16 GFLOPs~6.1xPacked panels and microkernel
GPUXLA based backendEffective throughput much higher~300x to ~380xIncludes launch and transfer overhead

Backend Selection Guideline

Backend selection should be guided by workload size, execution constraints, and system availability. The optimized CPU backend provides predictable performance with low overhead and strong determinism, making it suitable for moderate matrix sizes, latency sensitive tasks, and environments where accelerator access is limited or non deterministic.

The GPU backend is best suited for large scale, throughput oriented workloads where kernel launch latency and data transfer overheads can be amortized across substantial computation. When such conditions are met, GPUs deliver significantly higher effective throughput than CPU implementations.

This project demonstrates that an informed backend choice is essential for efficient execution and that CPU and GPU approaches serve complementary roles rather than acting as direct replacements.

GPU Execution in Decentralized and Constrained Environments

In decentralized or constrained execution environments, GPU acceleration introduces additional considerations beyond raw throughput. While GPUs offer substantial parallelism, their effective performance depends on reliable access to accelerator hardware, predictable kernel launch latency, and efficient data movement between host and device.

For moderate problem sizes such as 1024 cubed matrices, overheads related to runtime abstraction, memory transfers, and synchronization reduce utilization of peak GPU capabilities. These factors are particularly relevant in decentralized settings, where execution may occur across heterogeneous nodes with varying accelerator availability and bandwidth constraints.

As a result, optimized CPU execution remains a viable and often preferable option in such environments. The BLIS style CPU kernel provides consistent performance without reliance on specialized hardware, while the GPU backend serves as a high throughput option when execution context and workload scale permit effective amortization of overheads.

Reproducibility

All experiments are reproducible using the provided benchmarking harness and JSON workload specifications.

Example command:

cargo run --release --features gpu,blis,parallel,vectorized -- --workload tests/test_workload.json

Recommended build flags:

RUSTFLAGS="-C target-cpu=native -C codegen-units=1" cargo build --release

Correctness is verified via output hashing and maximum error checks.

Takeaways

This project demonstrates that careful low level performance engineering combining blocking, packing, vectorization, and parallelism can deliver substantial gains even without vendor specific libraries. The results highlight the continued relevance of HPC principles in modern constrained computing environments.