Out-of-Order RISC-V Processor

Built a custom out-of-order 5-stage RISC-V processor using Tomasulo’s Algorithm, featuring dynamic scheduling, speculative execution, and dependency resolution. The design includes core components such as reservation stations, reorder buffer (ROB), EBR, load/store queue (LSQ), cache arbiter, and common data bus (CDB), achieving significant performance gains while addressing real-world architectural bottlenecks.

The processor is a 5-stage out-of-order RISC-V core based on Tomasulo’s Algorithm, with four 4-entry reservation stations (ALU, CMP, MUL/DIV, Load/Store) and an 8-entry reorder buffer (ROB) for in-order commits. It features four Common Data Buses (CDBs) for parallel result broadcasting and uses a FIFO instruction queue with a 4-way set-associative I-Cache and D-Cache via a custom cache adapter. Functional units include pipelined ALUs and 6-cycle DW_div_pipe IPs for multiply/divide operations. The design enables efficient dynamic scheduling, dependency resolution, and high instruction throughput.

To enhance performance, the processor integrates several advanced features, including Early Branch Recovery (EBR), Age-Ordered Issue Scheduling, Post-Commit Store Buffer (PCSB), and a combined GShare Branch Predictor with a Branch Target Buffer (BTB). EBR enables early pipeline flushing on branch mispredictions, improving IPC by 20%. Age-based scheduling ensures fair and efficient instruction dispatch, reducing reservation station stalls. The PCSB optimizes memory access and enables load-store forwarding, resulting in an 8.6% IPC gain. GShare with BTB improves branch prediction accuracy up to 75% on control-heavy workloads. Collectively, these optimizations significantly reduce control and memory stalls while maintaining architectural correctness.

The processor was developed and verified using SystemVerilog, with simulation handled through Synopsys VCS and Verilator, and signal-level debugging performed in Verdi. Testing included a suite of custom and provided benchmarks such as CoreMark, AES-SHA, FFT, mergesort, and compression, with correctness validated via Spike logs. Trade-offs included increased area and complexity due to advanced features like EBR, GShare, and PCSB—resulting in ~20% area and ~4% power overheads. However, these were justified by the improved IPC and reduced pipeline stalls. Performance profiling was supported with Python scripts to analyze instruction frequency, cache behavior, and branch patterns, helping guide design choices throughout development.

This project showcases a comprehensive understanding of modern processor architecture by designing and implementing a fully functional out-of-order RISC-V core. From dynamic scheduling and dependency resolution to advanced optimizations like early branch recovery and memory forwarding, the design balances performance, complexity, and correctness. It highlights practical skills in microarchitecture, RTL design, verification, and performance analysis—demonstrating the challenges and trade-offs CPU architects navigate when pushing for higher efficiency and throughput in real-world systems.