Programme

2026-04-20

OC Opening Ceremony 08:30 - 09:00 | Auditorium

Chair: Valeria Bertacco (University of Michigan, US)

Co-Chair: Alberto Bosio (École Centrale de Lyon, FR)

OC.01 Welcome Addresses 08:30
Valeria Bertacco¹, Alberto Bosio²

¹ University of Michigan ; ² Ecole Centrale de Lyon
OC.02 Presentation of Awards 08:45
Jürgen Teich¹, Robert Wille², Cristiana Bolchini³, Yervant Zorian⁴

¹ Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) ; ² Technical University of Munich ; ³ Politecnico di Milano ; ⁴ Synopsys

OK01 Opening Keynote 1 09:00 - 09:45 | Auditorium

It's Time to Futureproof our Prosperity by Superfueling Innovation, Enabling Next-Gen AI 09:00
Luc Van de Hove¹

¹ IMEC

The AI field is evolving at an incredibly fast pace, with major models and updates being released almost every month. As these models evolve beyond Large Language Models towards next-gen AI with advanced reasoning capabilities, compute systems struggle to handle the heterogeneous workloads in a performant and sustainable way. However, developing new, AI-optimized compute architectures and the enabling semiconductor technologies takes much more time than writing algorithms. To prevent bottlenecks slowing down AI-based advancements, we must reinvent compute architectures and semiconductor technology platforms. The presentation will shed light on the need for flexible, versatile compute architectures implemented in flexible, versatile technology platforms while addressing the increasing challenges of density, power and memory. To speed up both advanced semiconductor technology R&D and full stack innovation for future AI applications, imec is expanding its pilot line infrastructure under the EU Chips Act. Next to new infrastructure, imec aims to boost innovation through intensified collaborations with complementary knowledge partners and through further internationalization, attracting global talent and building strong, local ecosystems for diverse application domains, like health and automotive. Transformative innovations for humankind hinge on the innovation pace of the semiconductor industry. It's time to supercharge our innovation engine, it's time to futureproof our prosperity.

OK02 Opening Keynote 2 09:45 - 10:30 | Auditorium

Quantum computing is one of the most promising technologies of our era. The fundamentally different nature of quantum processors (QPUs) requires a highly interdisciplinary effort to tackle challenges across the entire system to effectively leverage quantum resources. To realize the commercial potential of quantum acceleration, significant HPC resources are required to accelerate workloads necessary to the operation of a QPU. These workloads involve tasks related to quantum error correction, system calibration, and optimal control, and hence require and ultra-low latency exchange between the quantum system controllers (QSCs) and traditional processors such as GPUs. Introducing a tight coupling between HPC and QPU environment presents a number of challenges that have become a focus within the quantum-HPC community. Most QSCs are implemented using FPGAs or RFSoC devices, which run a firmware-defined pulse processor unit (PPU). Connecting the QSC via PCIe card can enable direct communication but poses scaling challenges. A potentially preferable alternative that offers better options for distribution is the use of a network interface card (NIC) for connection via Ethernet or InfiniBand. In this talk I will discuss NVIDIA's effort to facilitate the development of advanced QPUs using real-time processing on GPUs using the NVQLink architecture. NVQLink leverages the RDMA over Converged Ethernet (RoCE) protocol to bypass traditional network stacks and CPU involvement, enabling sub-microsecond data transfer. This is essential to achieve real-time quantum error correction, where latency tolerances are of the order of tens of microseconds for some QPU architectures. While most existing solutions are limited to using FPGA for real-time processing, the availability of real-time compute on GPUs greatly facilitates the use of machine learning and AI for automation and accuracy improvements. First realizations of NVQLink systems thus enable a big step forward towards achieving fault tolerant quantum computing at scale by enabling data-driven research and co-develop of hardware and software solutions to achieve the necessary throughput and latency for commercial applications.

Ultra-Low Latency Real-Time Processing for Quantum Computing 09:45
Bettina Heim¹

¹ Nvidia

Quantum computing is one of the most promising technologies of our era. The fundamentally different nature of quantum processors (QPUs) requires a highly interdisciplinary effort to tackle challenges across the entire system to effectively leverage quantum resources. To realize the commercial potential of quantum acceleration, significant HPC resources are required to accelerate workloads necessary to the operation of a QPU. These workloads involve tasks related to quantum error correction, system calibration, and optimal control, and hence require and ultra-low latency exchange between the quantum system controllers (QSCs) and traditional processors such as GPUs. Introducing a tight coupling between HPC and QPU environment presents a number of challenges that have become a focus within the quantum-HPC community. Most QSCs are implemented using FPGAs or RFSoC devices, which run a firmware-defined pulse processor unit (PPU). Connecting the QSC via PCIe card can enable direct communication but poses scaling challenges. A potentially preferable alternative that offers better options for distribution is the use of a network interface card (NIC) for connection via Ethernet or InfiniBand. In this talk I will discuss NVIDIA's effort to facilitate the development of advanced QPUs using real-time processing on GPUs using the NVQLink architecture. NVQLink leverages the RDMA over Converged Ethernet (RoCE) protocol to bypass traditional network stacks and CPU involvement, enabling sub-microsecond data transfer. This is essential to achieve real-time quantum error correction, where latency tolerances are of the order of tens of microseconds for some QPU architectures. While most existing solutions are limited to using FPGA for real-time processing, the availability of real-time compute on GPUs greatly facilitates the use of machine learning and AI for automation and accuracy improvements. First realizations of NVQLink systems thus enable a big step forward towards achieving fault tolerant quantum computing at scale by enabling data-driven research and co-develop of hardware and software solutions to achieve the necessary throughput and latency for commercial applications. Bettina Heim is a systems software engineering manager at NVIDIA where she leads the CUDA-Q Engineering team. Throughout her career, she has initiated and advanced industry efforts to develop standards and toolchains for a range of quantum architectures. After her Ph.D. in quantum physics at ETH Zurich, she joined Microsoft where she led the compiler and runtime development within Azure Quantum.

TS01 Energy Efficiency and Performance Optimization 11:00 - 12:30 | Aida

Chair: Graziano Pravadelli (University of Verona, IT)

Co-Chair: Behnaz Ranjbar (Ruhr University Bochum, DE)

TS01.1 AERO: Adaptive and Efficient Runtime-Aware OTA Updates for Energy-Harvesting IoT 11:00
Wei Wei¹, Jingye Xu², Sahidul Islam³, Dakai Zhu⁴, Chen Pan³, Mimi Xie³

¹ UTSA ; ² the University of Texas at San Antonio ; ³ The University of Texas at San Antonio ; ⁴ University of Texas at San Antonio

Energy-harvesting (EH) Internet of Things (IoT) devices operate under intermittent energy availability, which disrupts task execution and makes energy-intensive over-the-air (OTA) updates particularly challenging. Conventional OTA update mechanisms rely on reboots and incur significant overhead, rendering them unsuitable for intermittently powered systems. Recent live OTA update techniques reduce reboot overhead but still lack mechanisms to ensure consistency when updates interact with runtime execution. This paper presents AERO, an Adaptive and Efficient Runtime-Aware OTA update mechanism that integrates update tasks into the device's Directed Acyclic Graph (DAG) and schedules them alongside routine tasks under energy and timing constraints. By identifying update-affected execution regions and dynamically adjusting dependencies, AERO ensures consistent update integration while adapting to intermittent energy availability. Experiments on representative workloads demonstrate improved update reliability and efficiency compared to existing live update approaches.
TS01.2 Bespoke Co-processor for Energy-Efficient Health Monitoring on RISC-V-based Flexible Wearables 11:05
Theofanis Vergos¹, Polykarpos Vergos¹, Mehdi Tahoori², Georgios Zervakis³

¹ University of Patras ; ² Karlsruhe Institute of Technology ; ³ National Technical University of Athens

Flexible electronics offer unique advantages for conformable, lightweight, and disposable healthcare wearables. However, their limited gate count, large feature sizes, and high static power consumption make on-body machine learning classification highly challenging. While existing bendable RISC-V systems provide compact solutions, they lack the energy efficiency required. We present a mechanically flexible RISC-V that integrates a bespoke multiply-accumulate co-processor with fixed coefficients to maximize energy efficiency and minimize latency. Our approach formulates a constrained programming problem to jointly determine co-processor constants and optimally map Multi-Layer Perceptron (MLP) inference operations, enabling compact, model-specific hardware by leveraging the low fabrication and non-recurring engineering costs of flexible technologies. Post-layout results demonstrate near-real-time performance across several healthcare datasets, with our circuits operating within the power budget of existing flexible batteries and occupying only 2.42mm2, offering a promising path toward accessible, sustainable, and conformable healthcare wearables. Our microprocessors achieve an average 2.35x speedup and 2.15x lower energy consumption compared to the state of the art.
TS01.3 ALIFE-BCI: An Adaptive Low-power Integrated Feature Extractor for Brain-Computer Interfaces 11:10
Joe Saad¹, Ivan Miro-Panades², Adrian Evans³, Lorena Anghel⁴

¹ Univ. Grenoble Alpes, CEA LIST ; ² CEA-List ; ³ CEA ; ⁴ Grenoble-Alpes University, Grenoble, France

Brain-Computer Interfaces (BCIs) have the potential to restore motion for patients suffering from spinal cord injuries. Making such systems embedded, or even implantable, imposes strict low power constraints. Feature extraction, which transforms brain signals into intermediate representations before decoding motor intent, is typically the most compute intensive step. In this work, we introduce ALIFE-BCI, an Adaptive Quality Feature Extractor (AQFE), based on a Continuous Wavelet Transform (CWT) that captures the signal dynamics in both the time and frequency domains. The system is optimized with a top-down approach: (i) At the algorithmic level, it implements a piecewise linear approximation of the CWT that allows real-time energy-accuracy trade-offs. (ii) At the architectural level, memory re-use and parallelism are used to balance area and compute performance. (iii) At the circuit level, low-power techniques are used in a 22 nm FDSOI technology physical implementation flow. Three variants, with different levels of parallelism, are explored to extract 960 features at a rate of 10 Hz for a BCI motor application. The optimal variant, with an area of only 0.061 mm2, achieves 0.27 μW/Feature at maximum quality, and 0.13 μW/Feature at minimum quality, resulting in 8x lower power than existing digital solutions. Combined, these characteristics make the system well-suited for ultra low-power implantable BCI motor decoders.
TS01.4 Efficient LLM Decoding on Ryzen AI NPUs 11:15
Zhenyu Xu¹, Miaoxiang Yu¹, Jillian Cai¹, Qing Yang², Tao Wei¹

¹ Clemson University ; ² URI

We propose an efficient and scalable LLM decoding framework optimized for AMD Ryzen AI NPUs, leveraging two novel techniques: FusedDQP and FlowKV. FusedDQP fuses dequantization with projection to minimize memory operations and latency, while FlowKV introduces a pipelined, bandwidth-optimized approach for KV cache access across compute tiles (CT). Together, these methods deliver substantial improvements in both speed and energy efficiency without altering model accuracy. Our solution achieves up to 14.2× speedup and 2.66× power efficiency gains compared to existing state-of-the-art (SOTA) NPU baselines, demonstrating linear scalability with CT count and robustness across LLaMA-3.1/3.2 model variants (1B, 3B, and 8B parameters). We also benchmark against CPU and iGPU on the same platform, our performance surpasses CPU and iGPU (up to 1.8x and 16.2x speedup), while delivering substantially improved energy efficiency (up to 3.63x and 11.38x for CPU and iGPU, respectively).
TS01.5 EGO: Efficient Compression of Unstructured Sparse DNNs for Compute-in-Memory based on Graph Minimum-Cost Matching Optimization 11:20
Teng Wan¹, Yu Cao², Huazhong Yang¹, Xueqing Li¹

¹ Tsinghua University ; ² Beijing Institute of Technology

Compute-in-memory (CiM) for edge AI inference operates under strict memory and energy constraints. While unstructured pruning reduces model size and computation, efficiently deploying the sparse weights on CiM's dense, regular arrays remains challenging. Existing studies either incur high indexing overhead by storing per-element indexing metadata, or achieve limited compression by relying on scarce structural patterns within unstructured weights. The column packing method, which avoids the high overhead of per-element indexing and offers rich compression potential, shows promise to reconcile unstructured sparsity with CiM's regular compute pattern, but its direct application to CiM is hindered by heuristic grouping algorithms that either yield suboptimal compression or sacrifice model accuracy. To bridge this gap and unlock the potential of column packing for CiM, this study presents EGO, an algorithm-hardware co-designed framework. EGO overcomes the inefficiency of heuristic grouping by introducing a combinatorially optimized grouping algorithm, which formulates column packing as minimum-cost graph matching. A digital CiM architecture is co-designed with the EGO column grouping formulation, which features a custom Sparsity Processing Unit (SPU) to enable efficient activation routing while preserving CiM's dense and regular dataflow. Circuit-level simulations show that EGO achieves 1.4–3.7x average improvement in energy efficiency and 1.2–1.8x average improvement in area efficiency compared to previous state-of-the-art methods.
TS01.6 Grin: HyperGNN Training Framework for Efficient Edge Inference via Hypergraph Restructuring 11:25
Chaofang MA¹, Lin JIANG², Zeyu LI¹, Xingyu LIU¹, Jiang XU³, Wei Zhang⁴

¹ The Hong Kong University of Science and Technology ; ² Northeastern University ; ³ The Hong Kong University of Science and Technology (GZ) ; ⁴ Hong Kong University of Science and Technology

Hypergraph neural networks (HyperGNNs) have garnered increasing attention for their ability to model high-order relationships in various domains. However, the extremely sparse connections inherent to hypergraphs result in numerous off-chip memory accesses, posing a long-latency inference issue on edge devices. Existing hardware accelerators focus solely on exploiting the limited data reuse opportunities in hypergraphs to mitigate this issue, without addressing the underlying cause: the sparsity of the hypergraph structures themselves. To address the fundamental limitation, this paper proposes Grin, a general HyperGNN training framework. It is designed to restructure hypergraphs for enhancing inference efficiency on edge devices regardless of hardware architectures while improving model performance. Specifically, hyperedge pruning within Grin is utilized to eliminate redundant computation workloads, effectively lowering overall off-chip memory accesses. Moreover, Grin redefines the objective of traditional data augmentation by incorporating hardware efficiency alongside model accuracy. This shift enables significantly increased data reuse in the remaining computation workloads, thereby ensuring model performance and further reducing off-chip memory accesses. Experiments demonstrate that, with increased model accuracy, deploying Grin-optimized hypergraphs on the state-of-the-art (SOTA) accelerator achieves an average inference speedup of 1.41× compared to the original hypergraphs on the same accelerator, while reducing off-chip memory accesses by 27.60%. Furthermore, this deployment achieves a 14.82× speedup over the SOTA GPU-based system.
TS01.7 A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems 11:30
Qi Wu¹, Chao Fang², Jiayuan Chen³, Ye Lin¹, Yueqi Zhang¹, Yichuan Bai¹, Yuan Du¹, Li Du¹

¹ Nanjing University ; ² KU Leuven ; ³ China Mobile Research Institute

Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
TS01.8 Colored Huge Pages: A Hardware-Software Approach for Enhanced Isolation and Performance 11:35
Georgios-Alexandros Kostas¹, Dimitris Gizopoulos¹, Vasileios Karakostas¹

¹ University of Athens

Multicore CPUs typically share the Last-Level Cache (LLC) across cores, leading to interference between co-executing workloads with significant performance and security implications. Page coloring has emerged as an effective software mechanism for LLC partitioning. Simultaneously, virtual memory enables fundamental abstractions, but incurs increasing performance overheads due to address translation. Huge pages alleviate this issue by expanding TLB reach, thereby reducing TLB misses and the associated costly page table walks. However, these two techniques are considered mutually exclusive, since huge pages span all LLC sets, precluding coloring. In this paper we introduce Colored Huge Pages (CHP), a hardware-software co-design that enables the simultaneous use of page coloring and huge pages. By distributing the physical frames of a virtually contiguous huge page across physical memory in a predictable strided pattern, our design allows coloring of the individual pages while preserving the TLB reach and translation efficiency of conventional huge pages. On the software side, we modify the OS allocator to construct colored huge pages by extracting appropriately colored pages from larger physical blocks and caching leftover mappings for future use. On the hardware side, we extend the L2 TLB to efficiently translate these mappings by leveraging their regular structure. We implement our approach in a recent Linux kernel and evaluate it using memory intensive workloads. CHP mitigates LLC contention and address translation overheads, improving performance by 33.7% compared to using 4 KB pages without LLC partitioning, requiring only minimal OS and architectural modifications. Contrary to prior approaches, our proposal maintains comparable effectiveness under fragmentation, avoids inducing additional cache misses, and incurs only negligible page fault overhead relative to Transparent Huge Pages (THP).
TS01.9 Unary Positional System: Flexible Balance of Hardware Area and Performance 12:35
Zeshi Liu¹, Zheng Weng¹, Ruijie Tan¹, Guangming Tang¹, Haihang You¹

¹ State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Science

Modern computer architectures face challenges in balancing hardware overhead and performance. Binary computing, known for its compactness, requires hardware area that scales quadratically with precision, while unary computing, despite its simplicity, suffers from exponentially increasing computation time. This paper introduces Unary Positional System (UPS), a paradigm that combines spatial and temporal characteristics to address this trade-off. We develop UPS-based architectures to perform fundamental arithmetic operations, and apply it to GEMM and superconductor FFT processor. Experimental results show that UPS bridges the gap between binary and unary computing, offering a balanced solution with flexibility for further optimization.
TS01.10 An Open Source Design Exploration Tool for Battery and Coolant Configuration 11:45
Francesco Tosoni¹, Yukai Chen², Massimo Poncino³, Franco Fummi¹, Sara Vinco³

¹ University of Verona ; ² IMEC ; ³ Politecnico di Torino

Ensuring both electrical performance and effective thermal management in large-scale battery packs is a critical challenge for next-generation electric mobility and energy storage systems. Current modeling approaches often rely on rigid configurations or computationally expensive CFD simulations, limiting their use in early design stages. This work introduces a modular, compositional framework that enables the dynamic construction of battery packs of arbitrary size, where each cell is modeled individually with coupled electrical and thermal dynamics. The framework integrates a configurable liquid cooling system supporting multiple layouts and coolant types, allowing rapid evaluation of thermal management strategies under diverse operating conditions. By combining scalability, flexibility, and high computational efficiency, the proposed approach accelerates design iterations, reduces prototyping costs, and supports the development of safer and more reliable battery systems for real-world applications.
TS01.11 Exploring Heterogeneity-Aware Optimizations for Resource Efficient Edge Recommendation 11:46
Yerin Lee¹, Gyudong Kim¹, Eunjin Lee¹, Jeff Zhang², Young-Ho Gong³, Young Geun Kim¹, Carole-Jean Wu⁴

¹ Korea University ; ² Arizona State University ; ³ Soongsil University ; ⁴ Meta

Recommendation systems are widely deployed on edge devices to enable personalized user experience. While recommendation inference has traditionally been performed on centralized servers, recent advances in mobile SoCs have motivated a shift toward on-device execution. However, achieving efficient recommendation inference on edge devices remains challenging due to edge-specific execution characteristics and heterogeneity. In this paper, we characterize resource inefficiencies under realistic edge constraints and propose optimization strategies.

TS02 Trustworthy AI-Driven and Hardware-Level Security Techniques 11:00 - 12:30 | Tosca

Chair: Naghmeh Karimi (University of Maryland Baltimore County (UMBC), US)

Co-Chair: Muhammad Shafique (New York University Abu Dhabi, US)

TS02.1 Glitch Propagation through Flip-Flops Endangers Masking Schemes: Why Time Separation Is Required 11:00
Hasin Ishraq Reefat¹, Mohammad Ebrahimabadi², sofiane takarabt³, Sylvain Guilley³, Naghmeh Karimi²

¹ University of Maryland, Baltimore County ; ² University of Maryland Baltimore County ; ³ Secure-IC

Glitches are hardware-level hazards that are capable of compromising secure implementations. Even dominant protections against side-channel attacks must demonstrate immunity in the potential presence of glitches. In this paper, we study two hardware masking schemes rationales, namely Ishai-ShaiWagner (ISW) and its Enhanced version (E-ISW), as well as Domain-Oriented masking (DOM). While other glitch-aware masking schemes have been proposed, our focus is specifically on the differences between E-ISW and DOM. Those two styles rely respectively on combinational and on sequential separation of shares. It is known that sequential separation, realized through pipelining stages, does impact the latency of the hardware masking scheme. Additionally, in this paper, we show another drawback: pipelining does not provide full independence between manipulated shares. Indeed, we show that pipelining elements (DFFs in practice) can propagate upstream activity downstream. This results in first-order leakage in real-world systems, especially when parasitic effects are considered. In this respect, we show that DOM is leaking at first-order, and that this leakage increases with both the complexity of the netlist (in terms of number of DOM gadgets) and with the extent to which the operational environment can be worsened by an attacker (e.g., lowering the voltage to increase the leakage). These findings provide valuable insights for advancing secure hardware design.
TS02.2 FortiSky: Enhancing Adversarial and Bit-Error Robustness for Efficient and Secure Autonomous Systems 11:05
Zishen Wan¹, Karthik Swaminathan², Nandhini Chandramoorthy², Pin-Yu Chen², Tushar Krishna¹, Vijay Janapa Reddi³, Arijit Raychowdhury¹

¹ Georgia Institute of Technology ; ² IBM Research ; ³ Harvard University

Autonomous systems, such as unmanned aerial vehicles (UAVs), are required to employ complex AI models to execute fully autonomous position-navigation-timing missions. However, deploying such functionalities on UAVs remains challenging due to stringent onboard size, weight, and power constraints. Further, these UAVs may also be vulnerable to adversarial attacks in complex real-world environments. Existing methods often address either efficiency or robustness, but rarely both, frequently neglecting or even compromising one to optimize the other. To this end, we propose FortiSky, a robust learning framework to jointly enhance robustness to both adversarial and random bit errors, realizing efficient and secure UAV systems. FortiSky supports both single-agent and multi-agent robust learning, both offline and on-device, with adaptive and collaborative optimizations. FortiSky is the first design that ensures both adversarial robustness and high energy efficiency enabled by very-low voltage operation onboard UAVs. Through extensive system-level UAV experiments combining algorithm-level robust learning and hardware-level silicon tests, FortiSky achieves 3.73x processing energy reduction and 14.6% mission energy reduction, thus effectively co-optimizing efficiency and robustness of onboard UAV.
TS02.3 Distilling Graph Reasoning into Lightweight CNNs for Near-Sensor Point Cloud Corruption Detection 11:10
Grafika Jati¹, Martin Molan¹, Francesco Barchi², Andrea Bartolini¹, Giuseppe Mercurio³, Andrea Acquaviva¹

¹ University of Bologna ; ² Università di Bologna ; ³ FEV Italia s.r.l.

Real-world point cloud corruption on automotive LiDAR lenses can significantly degrade the reliability of downstream perception, particularly object detection models trained on clean data, which may yield overconfident false positives. To address this, we propose a near-sensor gating module that classifies incoming point clouds as either clean or contaminated using a teacher–student knowledge distillation pipeline. A Graph Attention Network (GAT), trained directly on raw point clouds, serves as the teacher. On real-world contaminated LiDAR data, the distilled student achieves an average F1-score of 0.83, closely matching the GAT teacher's 0.88, and consistently outperforming other supervised baselines across diverse test environments. Importantly, the student's 2D-CNN architecture reduces preprocessing complexity from O(n \log n) of graph construction to O(n), enabling faster and more efficient point cloud handling. The student model is quantized to 16-bit fixed-point and deployed on a GAP8 (RISC-V) platform. It achieves an inference latency of 26 milliseconds, consumes only 210 micro Joule per inference, and fits within 84KB of L2 memory. This makes the proposed solution a practical and resource-efficient near-sensor gating module for robust, contaminant-aware perception in embedded automotive systems. The implementation will be available at https://gitlab.com/ecs-lab/distilling-pointcloud-corruption.
TS02.4 FHEx: Transforming Generic Compute Chips into Secure FHE Engines via a Hardware-software Co-design Framework 11:15
Yibo Du¹, Ying Wang², Mengdi Wang³, Cangyuan Li⁴, Lian Liu⁵, Hui Li⁶, kai zhang⁶, Yinhe Han³

¹ Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences ; ² State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences ; ³ Institute of Computing Technology, Chinese Academy of Sciences ; ⁴ ICT ; ⁵ State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences ; ⁶ Jinan Inspur Data Technology Co., Ltd.

Fully Homomorphic Encryption (FHE) is a powerful privacy-preserving technology enabling secure computation on encrypted data, but it suffers from substantial performance overheads. Running FHE efficiently typically requires developing dedicated FHE accelerators, which can be costly and inflexible. Instead of pursuing entirely new accelerators, this paper explores an alternative paradigm: augmenting generic computing devices with a modular FHE-specific hardware extension unit (HEU) to create an efficient FHE engine. To make this paradigm viable, we propose three key innovations: (1) recognizing that some FHE operators are data-intensive and involve a massive volume of ciphertexts, we design the HEU with a 3D stacked memory-based architecture to handle data-intensive operators. We also provide software-level support to facilitate deploying FHE tasks on this extension-based architecture. (2) To capitalize on the hardware parallelism, we propose an adaptive offloading algorithm that intelligently distributes FHE operators between the computing device and the HEU. (3) To optimize the data layout and minimize the inter-tile data communications in the novel 3D stack memory, we propose a dedicated ciphertext mapping mechanism. Experimental results demonstrate that our work provides a flexible alternative to dedicated FHE accelerators, achieving substantial acceleration in FHE tasks.
TS02.5 Towards Trustworthy LLM-Based Assertion Generation: A Data Augmentation Framework with Formal Check Approach 11:20
Qingchen Zhai¹, Hao Yu², Chen BAI², Charles Young³, Frank Qu⁴, Dezhi Ran³, Yuan Xie², Tao Xie³

¹ Institute of Automation, Chinese Academy of Sciences ; ² Hong Kong University of Science and Technology ; ³ Peking University ; ⁴ University of California Santa Barbara

Formal verification is a major bottleneck in integrated circuit (IC) design due to the inefficiency and inaccuracy of manual assertion writing and the limitations of existing automation approaches. While large language models (LLMs) offer a promising alternative for assertion generation, their effectiveness has been constrained by the scarcity of high-quality, formally verified training data. To address these challenges, we propose AutoAssert, an framework of automated assertion generation leveraging formal equivalence checking into the assertion generation pipeline, and introduce TrustAssert, a public dataset containing 110K formally verified assertions. By fine-tuning LLMs on TrustAssert, we achieve substantial improvements across four representative hardware modules. Our approach significantly outperforms GPT-4 in terms of the ratio of non-trivial assertions generated, syntactic correctness, and functional verification accuracy.
TS02.6 Automated Self-Explanation of Expected versus Perceived Behavior for Interacting Digital Systems 11:25
Mohammad Alkhiyami¹, Gianluca Martino², Goerschwin Fey³

¹ Hamburg University of Technology, Hamburg, Germany ; ² Lawrence Berkeley National Laboratory ; ³ Hamburg University of Technology

Modern interacting digital systems are becoming increasingly complex, making it difficult to ensure their actual behavior aligns with design-time expectations, particularly in uncertain or dynamic environments, even when specifications are correct. This misalignment affects system scalability, reliability, and increases maintenance costs. We introduce a conceptual framework for identifying and self-explaining mismatches between expected and observed system behavior, together with an algorithm that generates explanations and case studies that apply the conceptual framework for explanation generation in an interacting digital systems setting.
TS02.7 Exploring a Resource-Efficient NTT FPGA Accelerator for Fully Homomorphic Encryption 11:30
Valentino Guerrini¹, Giuseppe Sorrentino², Davide Conficconi¹

¹ Politecnico di Milano ; ² Politecnico Di Milano

CKKS encryption scheme stands as one of the most valuable solutions for Fully Homomorphic Encryption (FHE), enabling privacy-preserving computation on encrypted data, at the cost of high computational bottlenecks. In such a scheme, the Number Theoretic Transform (NTT) consumes most of the computational resources due to irregular memory access patterns. Thus, literature accelerates this step on specific hardware devices, such as FPGAs, often exhausting device resources while gaining performance and energy efficiency improvements. However, this prevents further utilization of the FPGA to accelerate other compute-intensive stages. As an alternative, we perform the HW/SW co-design of resource-efficient solutions by integrating them into well-known software libraries implementing CKKS encryption scheme. In particular, we deploy on a Kria KV260 SoC a resource-efficient NTT accelerator with state-of-the-art security parameters (logN ∈ 12, ..., 16 and logQ ∈ [32, 64]), and integrate it into the full-RNS HEANN library – the reference implementation for CKKS scheme. By doing so, we obtain up to 4.47× and 3.63× speedup in the encoding and encryption steps, respectively, while minimizing hardware consumption. These results show the end-to-end improvements achievable without fully utilizing the FPGA resources, leaving headroom for accelerating additional stages of the encryption pipeline.
TS02.8 ML-DSA-OSH: An Efficient, Open-Source Hardware Implementation of ML-DSA 11:31
Quinten Norga¹, Suparna Kundu¹, Ingrid Verbauwhede¹

¹ COSIC, KU Leuven

ML-DSA is a post-quantum lattice-based digital signature algorithm (DSA) that the National Institute of Standards and Technology (NIST) recently standardized as FIPS 204. Remarkably, there are only a handful of published hardware designs and no open-source hardware implementations of complete ML-DSA. In this work, we present an efficient open-source hardware (OSH) design of ML-DSA, based on a Dilithium implementation by Beckwith et al. (FPT 2021). We also discuss the required modifications for migrating existing CRYSTALS-Dilithium implementations to match FIPS 204. Through optimized instruction scheduling in the ML-DSA rejection loop, which enables the pre-computation of critical variables, the average signing latency is improved by 16 - 36 %.
TS02.9 Postponing the Glitches is Not Enough - A Critical Analysis of the DATE 2024 E-ISW Masking Scheme 11:32
Amir Moradi¹

¹ Technische Universitaet Darmstadt

The Enhanced ISW (E-ISW) masking scheme, proposed at DATE 2024, aims to reduce glitch-induced leakage by enforcing input-complete gate evaluation with artificial delays. However, our theoretical analysis show that E-ISW still exhibits first-order leakage under its intended conditions. These flaws arise from a lack of compositional reasoning about glitches and masking, rendering the scheme insecure.
TS02.10 From Trigger to Impact: Knowledge-Graph Reasoning and Risk-Aware Classification for Hardware Trojan Detection 11:33
Yang Zhang¹, Xing Hu¹, Wen Chen¹, Huan Guo¹, Zhen Zhao¹, Sheng Liu¹

¹ College of computer science and Technology, National University of Defense Technology

Hardware Trojans (HTs) in modern ICs pose severe threats to system security. Existing methods often treat HT detection as a binary classification task, overlooking functional behavior and impact. This work introduces an impact-aware framework that models triggers, payloads, and attack targets, forming localized subgraphs that reflect activation dependencies and interactions. These are embedded into a novel knowledge-graph representation—enabling explainable reasoning based on structural, functional, and criticality semantics. A risk-aware classifier then ranks HT severity, helping engineers prioritize responses. Unlike prior approaches, our method not only detects HTs but explains their intent and impact. Experiments on Trust-Hub benchmarks show a 14.05% accuracy gain over state-of-the-art methods, with enhanced interpretability bridging low-level analysis and high-level security.

TS03 Routing, Mapping, and Interconnects for Scalable Hardware Systems 11:00 - 12:30 | Rigoletto

Chair: Ian O'Connor (Ecole Centrale de Lyon, FR)

Co-Chair: Dirk Stroobandt (Ghent University, BE)

TS03.1 Eunomia: Preemption-based and QoS-aware Core Allocation in Oversubscribed Cloud 11:00
Yunda Guo¹, Puqing Wu¹, Haoqiong Bian¹, Yunpeng Chai¹, Yao Shen², Haoyu Yang², Qing Liu², Zhengbin Huang², Le Yue², Yi Yang²

¹ Renmin University of China ; ² Huawei Cloud

Colocating high- and low-priority VMs under CPU oversubscription is an effective way to improve resource utilization, but it demands careful core allocation to control contention and ensure the QoS. Existing solutions typically rely on Linux cgroup mechanisms such as cpuset, quota, and share. However, our experiments show that these mechanisms have inherent limitations and cannot simultaneously ensure QoS and resource efficiency. Unconditional preemption introduces new opportunities, which is a new kernel feature supported by major cloud vendors. Our experimental analysis reveals that, while unconditional preemption provides strong performance guarantees for high-priority VMs, it also increases the risk of starving low-priority VMs. We present Eunomia, a CPU core allocator for oversubscribed clouds. Eunomia employs a black-box QoS degradation detection model that leverages transmit packet counts and kernel-level KVM tracepoints to identify performance degradation in high-priority VMs. Guided by this model, Eunomia selectively enables unconditional preemption only for degraded high-priority VMs, ensuring QoS while improving CPU efficiency. Experiments show that Eunomia delivers high-priority performance comparable to isolated execution while improving low-priority throughput by 50–64% over the best-performing cpuset-based baseline.
TS03.2 Breaking Standard Cell Margin Constraints for Area-Efficient VLSI Design 11:05
Junghyun Yoon¹, Jooyeon Jeong², Heechun Park¹

¹ Ulsan National Institute of Science and Technology (UNIST) ; ² University of California, Los Angeles

In standard-cell-based VLSI design, fixed margins at cell boundaries are necessary to prevent short violations between adjacent transistors carrying different signals. However, these margins are redundant for most abutted cell pairs and incur non-negligible area overhead when accumulated across the chip. In this paper, we present a novel VLSI design optimization framework that eliminates redundant margins by strategically merging adjacent cells into margin-free cells (MF-cells), which preserve the same functionality with reduced area due to the removal of inter-cell margins. Precisely, we identify optimal cell pairs for merging from an initial standard-cell-based placement using a maximum weighted matching (MWM) algorithm. Each identified pair is replaced with an MF-cell and placed at an optimal position using a placement algorithm that minimizes wirelength and routing congestion. Compared to the conventional standard-cell-based design, we achieve on average 3.9% reduction in total cell area and 4.7% reduction in full chip area, leading to 2.7% reduction in total wire length and 2.1% improvement in timing performance while maintaining comparable power consumption. Our framework is a practical approach to achieve meaningful area and timing improvements, which is fully compatible with commercial standard-cell-based VLSI design flow.
TS03.3 Accelerating Detailed Routing Convergence Through Offline Reinforcement Learning 11:10
Afsara Khan¹, Austin Rovinski²

¹ NYU Tandon School of Engineering ; ² New York University

Detailed routing remains one of the most complex and time-consuming steps in modern physical design due to the challenges posed by shrinking feature sizes and stricter design rules. Prior detailed routers achieve state-of-the-art results by leveraging iterative pathfinding algorithms to route each net. However, runtimes are a major issue in detailed routers, as converging to a solution with zero design rule violations (DRVs) can be prohibitively expensive. In this paper, we propose leveraging reinforcement learning (RL) to enable rapid convergence in detailed routing by learning from previous designs. We make the key observation that prior detailed routers statically schedule the cost weights used in their routing algorithms, meaning they do not change in response to the design or technology. By training a conservative Q-learning (CQL) model to dynamically select the routing cost weights which minimize the number of algorithm iterations, we find that our work completes the ISPD19 benchmarks with 1.56x average and up to 3.01x faster runtime than the baseline router while maintaining or improving the DRV count in all cases. We also find that this learning shows signs of generalization across technologies, meaning that learning designs in one technology can translate to improved outcomes in other technologies
TS03.4 Soft-Constrained Triple Patterning Layout Decomposition 11:15
Mengjia Dai¹, Hongduo Liu¹, Yuhao Ji², Xiaojing Su³, Yibo Lin⁴, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² Chinese University of Hong Kong ; ³ Institute of Microelectronics of the Chinese Academy of Sciences ; ⁴ Peking University

Triple patterning layout decomposition (TPLD) is essential for scaling at advanced nodes. Industrial layout decomposition must handle a hierarchy of design rules, prioritizing mandatory hard constraints over negotiable soft constraints. Current research primarily consider a single constraint type, failing to differentiate constraint priorities. Moreover, long-range soft constraints create massive graphs that challenge runtime. We propose a soft-constrained TPLD framework that models both constraint types with a weighted graph formulation. We developed two complementary solvers. An Integer Linear Programming (ILP)-based solver which provides optimal solutions. A hybrid solver which combines an enhanced greedy initialization with localized ILP-based refinement for near-optimal quality at lower runtime. To further speed up the process, a simplification graph recovery method exploits localized neighbor queries to reduce long- range recovery overhead. Compared with Calibre, the ILP- based solver consistently achieves optimal cost, reducing the average cost by 54% on ISCAS and 69% on ISPD'19 benchmarks. The hybrid solver delivers near-optimal cost while achieving average speedups of 4.16× on ISCAS and 2.63× on ISPD'19 relative to Calibre.
TS03.5 Communication-Aware Hybrid Parallelism Mapping for Low-Cost MCM-based DNN Accelerators 11:20
Jicheon Kim¹, Chunmyung Park¹, Xuan Truong Nguyen¹, Hyuk-Jae Lee¹

¹ Seoul National University

The growing scale of deep neural networks has surpassed the capacity of single-chip accelerators, particularly pin cost-sensitive edge devices. Multi-Chip-Module (MCM) architectures enable scalability but rely on bandwidth-limited chip-to-chip (C2C) interfaces, causing substantial inter-chip communication overhead. Among model-parallel strategies, tensor parallelism (TP) offers high concurrency at the cost of communication overhead, while pipeline parallelism (PP) reduces it at the cost of lower compute utilization inherent to pipeline execution. This work presents Stitch, a two-phase rebalancing framework for hybrid model-parallel mapping in low-cost MCM-based CNN accelerators. Phase I mitigates TP's C2C-induced communication overhead by jointly optimizing partitioning and datapath through a layer-wise C2C-DRAM selection solved via dynamic programming. Since TP alone cannot fully minimize communication, Phase II extends the design space by combining TP and PP at the package level. Guided by simulated annealing, Stitch selects layer groups, tunes pipeline stages, and balances communication-utilization trade-offs. Evaluation on a cycle-accurate simulator shows that Stitch reduces the energy-delay product by up to 42.8% compared to prior TP-based methods, demonstrating its effectiveness under practical C2C bandwidth constraints.
TS03.6 A Reusable Methodology for High-Performance Interconnects using a Standard-Cell based Asynchronous NoC Router 11:25
Chonghui Zhang¹, Yizhe Hu¹, Yi Kang¹

¹ University of Science and Technology of China

As multi-core and Chiplet systems increase in complexity, the bottlenecks of traditional synchronous network-on-chip (NoC) in clocking, power, and timing closure have become a critical barrier to performance scaling. To address this challenge, asynchronous circuits offer a compelling path forward, yet existing designs often struggle with the tradeoff between performance, which typically relies on customization, and EDA flow compatibility, which is often compromised. This paper presents and implements a fully asynchronous, standard-cell-based NoC router paradigm, aimed at translating the theoretical advantages of asynchrony into a practical, industrially viable solution. The paradigm synergizes an innovative Mix-Rail encoding strategy with an EDA-flow-friendly Click-style asynchronous handshake circuit, ensuring full compatibility with mainstream commercial EDA toolchains. This methodology not only achieves high performance but also, through its inherent modularity and generality, facilitates straightforward integration with more advanced NoC technologies. We rigorously validated this design. A test chip fabricated and measured in a 22nm CMOS process achieves a state-of-the-art average latency of 0.63 ns and a high energy efficiency of 0.16 pJ/bit at 0.85V, outperforming published SOTA asynchronous baselines. Furthermore, system-level simulations confirm the performance superiority of the asynchronous approach against a functionally equivalent synchronous baseline. This work provides a robust solution for energy-efficient interconnects in next-generation heterogeneous computing systems.
TS03.7 MONET: A Mixture-of-Experts Accelerator with a Multicast-Optimized Two-Tier Network-on-Chip 11:30
Siqin Liu¹, Maya Roediger¹, Avinash Karanth¹

¹ Ohio University

The growing complexity of Mixture-of-Experts (MoE) models in machine learning applications demands innovative hardware solutions to address their unique computational and data movement challenges. Some of the critical challenges facing MoE models include sparse activation, dynamic token routing and irregular computation patterns that lead to low utilization and higher communication latency. In this paper, we introduce MONET, a novel two-tier Network-on-Chip (NoC) architecture designed to efficiently execute MoE workloads by co-optimizing compute, memory, and interconnect subsystems. The first tier consists of a reconfigurable systolic processing element (PE) island, executing both gating and expert computations, with runtime-configurable support for sparse/dense operations, expert reordering, and activation functions. The second tier incorporates a dual mesh network connecting a grid of PE islands; one network manages input token delivery with a broadcast scheme optimized for the gating phase of MoE, while the other is tailored for efficient inter-expert communication necessary for result aggregation. Evaluated on MoE benchmarks, MONET demonstrates up to 8.5X lower latency and over 6X better energy efficiency compared to state-of-the-art MoE accelerators.
TS03.8 Efficient Throughput Analysis of Synchronous Dataflow Graphs via Parametric Shortest Path 11:35
Zhengzheng Tian¹, Mingze Ma², Jian Hou¹

¹ Zhejiang Sci-Tech University ; ² Wenzhou Business College

Synchronous Dataflow Graphs (SDFGs) are widely employed to model real-time embedded systems and streaming data processing, where throughput serves as a critical measure of computational efficiency. Parametric Shortest Path (PSP) algorithms offer an effective means of analyzing the optimal throughput of Homogeneous SDFGs (HSDFGs). However, applying PSP algorithms to general SDFGs typically requires a conversion to HSDFGs, which introduces additional overhead in graph transformation and may result in exponential growth in graph size. This paper proposes an extension to a traditional PSP algorithm, enabling direct throughput analysis of SDFGs without explicit conversion to HSDFGs. Furthermore, a graph size reduction technique is incorporated to further optimize the runtime of the proposed algorithm. Experimental results demonstrate that the proposed algorithm achieves, on average, a shorter runtime than three state-of-the-art algorithms. The advantage of the proposed algorithm scales with the size of the SDFG, achieving a speedup of up to 39.05x over the fastest of the three baseline algorithms.

TS04 AI inference and learning at the edge 11:00 - 12:30 | Traviata

Chair: Alessio Burrello (Politecnico di Torino / Universita di Bologna, IT)

Co-Chair: Flavio Ponzina (San Diego State University, US)

TS04.1 Unified Class and Domain Incremental Learning with Mixture of Experts for Indoor Localization 11:00
Akhil Singampalli¹, Sudeep Pasricha¹

¹ Colorado State University

Indoor localization using machine learning has gained traction due to the growing demand for location-based services. However, its long-term reliability is hindered by hardware/software variations across mobile devices, which shift the model's input distribution to create domain shifts. Further, evolving indoor environments can introduce new locations over time, expanding the output space to create class shifts, making static machine learning models ineffective over time. To address these challenges, we propose a novel unified continual learning framework for indoor localization called MOELO that, for the first time, jointly addresses domain-incremental and class-incremental learning scenarios. MOELO enables a lightweight, robust, and adaptive localization solution that can be deployed on resource-limited mobile devices and is capable of continual learning in dynamic, heterogeneous real-world settings. This is made possible by a mixture-of-experts architecture, where experts are incrementally trained per region and selected through an equiangular tight frame based gating mechanism ensuring efficient routing, and low-latency inference, all within a compact model footprint. Experimental evaluations show that MOELO achieves improvements of up to 25.6× in mean localization error, 44.5× in worst-case localization error, and 21.5× lesser forgetting compared to state-of-the-art frameworks across diverse buildings, mobile devices, and learning scenarios.
TS04.2 Compression Space Search: RL-Based Combinational Compression for Neural Networks 11:05
Yingtao Shen¹, Yinchen Ni¹, Jiace Zhu¹, Jie Zhao², An Zou¹

¹ Shanghai Jiao Tong University ; ² Microsoft

The rising demand for lightweight, high-performance models on mobile and embedded platforms has accelerated the development of model compression techniques. Among these, combinational compression methods—which integrate multiple techniques such as pruning and quantization—offer complementary advantages over using individual methods alone. However, existing research typically focuses on specific combinations designed for a particular model architecture or task. These approaches often overlook the need for a general approach capable of identifying the optimal combination strategy, including the selection, sequence, and degree of applying compression methods. In this paper, we formalize the challenge of combining compression methods—specifically their selection, ordering, and compression degree—as a customized Markov Decision Process defined in a configurable compression space. To solve this, we introduce Compression Space Search (CSS), a practical RL-based framework for automatically and efficiently discovering optimal compression strategies. Experiments across CNN and transformer based vision models demonstrate that the proposed CSS achieves a 30 to 101 times reduction in bit operations while maintaining an accuracy drop of no more than 2%.
TS04.3 TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge 11:10
Run Wang¹, Victor Jung¹, Philip Wiese¹, Francesco Conti², Alessio Burrello³, Luca Benini⁴

¹ ETH Zurich ; ² University of Bologna ; ³ Politecnico di Torino and UniversitÃ di Bologna ; ⁴ Università di Bologna and ETH Zurich

On-device tuning of deep neural networks enables long-term adaptation at the edge while keeping data fully private and secure. However, the high computational demand of backpropagation remains a major challenge for ultra-low-power, memory-constrained extreme-edge devices. Attention-based models further exacerbate this challenge due to their complex architecture and scale. We present TrainDeeploy, a novel framework that unifies efficient inference with on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy is the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, as well as multiple training techniques such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end fine-tuning of a complete Transformer, CCT, achieving 11 trained images per second. We show that LoRA on-device leads to a 23% reduction in dynamic memory usage, a 15× reduction in trainable parameters and gradients, and a 1.6× reduction in memory transfer compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71–126M FLOPs) and state-of-the-art performance up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while simultaneously widening the scope compared to existing frameworks to support both CNNs and Transformers with parameter-efficient tuning.
TS04.4 QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models 11:15
Rachmad Vidya Wicaksana Putra¹, Pasindu Wickramasinghe¹, Muhammad Shafique²

¹ New York University (NYU) Abu Dhabi ; ² New York University Abu Dhabi (NYUAD)

Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints. Hence, QSLM framework advances the efforts in enabling efficient design automation for embedded implementation of SLMs.
TS04.5 FiCABU: A Fisher‑Based, Context‑Adaptive Machine Unlearning Processor for Edge AI 11:20
Eun-Su Cho¹, Jongin Choi¹, Jeongmin Jin¹, Jae-Jin Lee², Woojoo Lee¹

¹ Chung-Ang University ; ² ETRI

Machine unlearning, driven by privacy regulations and the "right to be forgotten," is increasingly needed at the edge, yet server-centric or retraining-heavy methods are impractical under tight computation and energy budgets. We present FiCABU (Fisher-based Context-Adaptive Balanced Unlearning), a SW–HW co-design that brings unlearning to edge AI processors. FiCABU combines (i) Context-Adaptive Unlearning, which begins edits from back-end layers and halts once the target forgetting is reached, with (ii) Balanced Dampening, which scales dampening strength by depth to preserve retain accuracy. These methods are realized in a full RTL design of a RISC-V edge AI processor that integrates two lightweight IPs for Fisher estimation and dampening into a GEMM-centric streaming pipeline, validated on an FPGA prototype and synthesized in 45 nm for power analysis. Across CIFAR-20 and PinsFaceRecognition with ResNet-18 and ViT, FiCABU achieves random-guess forget accuracy while matching the retraining-free Selective Synaptic Dampening (SSD) baseline on retain accuracy, reducing computation by up to 87.52% (ResNet-18) and 71.03% (ViT). On the INT8 hardware prototype, FiCABU further improves retain preservation and reduces energy to 6.48% (CIFAR-20) and 0.13% (PinsFaceRecognition) of the SSD baseline. In sum, FiCABU demonstrates that back-end–first, depth-aware unlearning can be made both practical and efficient for resource-constrained edge AI devices.
TS04.6 HILAL: Hessian-Informed Layer Allocation for Heterogeneous Analog–Digital Inference 11:25
Aniss Bessalah¹, Hatem Abdelmoumen², Karima Benatchba¹, Hadjer Benmeziane³

¹ Ecole Nationale Supérieure d'Informatique ; ² ESI Algiers ; ³ IBM Research

Heterogeneous AI accelerators that combine high precision digital cores with energy-efficient analog in-memory computing (AIMC) units offer a promising path to overcome the energy and scalability limits of deep learning. A key challenge, however, is to determine which neural network layers can be executed on noisy analog units without compromising accuracy. Existing mapping strategies rely on ad-hoc heuristics and lack principled noise-sensitivity estimation. We propose HILAL (Hessian-Informed Layer Allocation), a framework that systematically quantifies layer robustness to analog noise using two complementary metrics: Hessian-based Noise Impact and Spectral Concentration Ratio. Layers are partitioned into robust and sensitive groups via clustering, enabling threshold-free mapping to analog or digital units. To further mitigate accuracy loss, we gradually offload layers to AIMC while retraining with noise injection. Experiments on convolutional networks and transformers across CIFAR-10/100, ImageNet and SQuAD show that HILAL is on average 3.09x faster in search and mapping runtime than SOTA methods while achieving less accuracy degradation and maximizing analog utilization.
TS04.7 PatchBlock: A Lightweight Defense Against Adversarial Patches for Embedded EdgeAI Devices 11:30
Nandish Chattopadhyay¹, Abdul Basit², Amira Guesmi³, Muhammad Abdullah Hanif⁴, Bassem Ouni⁵, Muhammad Shafique²

¹ Indian Institute Of Technology Guwahati ; ² New York University Abu Dhabi (NYUAD) ; ³ NYU Abu Dhabi ; ⁴ New York University Abu Dhabi ; ⁵ Technology Innovation Institute (TII)

Adversarial attacks pose a significant challenge to the reliable deployment of machine learning models in EdgeAI applications, such as autonomous driving and surveillance, which rely on resource-constrained devices for real-time inference. Among these, patch-based adversarial attacks, where small malicious patches (e.g., stickers) are applied to objects, can deceive neural networks into making incorrect predictions with potentially severe consequences. In this paper, we present PatchBlock, a lightweight framework designed to detect and neutralize adversarial patches in images. Leveraging outlier detection and dimensionality reduction, PatchBlock identifies regions affected by adversarial noise and suppresses their impact. It operates as a pre-processing module at the sensor level, efficiently running on CPUs in parallel with GPU inference, thus preserving system throughput while avoiding additional GPU overhead. The framework follows a three-stage pipeline: splitting the input into chunks (Chunking), detecting anomalous regions via a redesigned isolation forest with targeted cuts for faster convergence (Separating), and applying dimensionality reduction on the identified outliers (Mitigating). PatchBlock is both model- and patch-agnostic, can be retrofitted to existing pipelines, and integrates seamlessly between sensor inputs and downstream models. Evaluations across multiple neural architectures, benchmark datasets, attack types, and diverse edge devices demonstrate that PatchBlock consistently improves robustness, recovering up to 77% of model accuracy under strong patch attacks such as the Google Adversarial Patch, while maintaining high portability and minimal clean accuracy loss. Additionally, PatchBlock outperforms the state-of-the-art defenses in efficiency, in terms of computation time and energy consumption per sample, making it suitable for EdgeAI applications.
TS04.8 Efficient CNN Inference on Ultra-Low-Power MCUs via Saturation-Aware Convolution 11:35
Shiming Li¹, Luca Mottola², Yuan Yao¹, Stefanos Kaxiras¹

¹ Uppsala University ; ² Politecnico di Milano, Italy

Quantized CNN inference on ultra-low-power MCUs incurs unnecessary computations in neurons that produce saturated output values. These values are too extreme and are eventually clamped to the boundaries allowed by the neuron. Often times, the neuron can save time by only producing a value that is extreme enough to lead to the clamped result, instead of completing the computation, yet without introducing any error. Based on this, we present saturation-aware convolution: an inference technique whereby we alter the order of computations in convolution kernels to induce earlier saturation, and value checks are inserted to omit unnecessary computations when the intermediate result is sufficiently extreme. Our experimental results display up to 24% inference time saving on a Cortex-M0+ MCU, with zero impact on accuracy.
TS04.9 Lupin: Spatial Resource Stealing with Outlier-First Encoding for Mixed-Precision LLM Acceleration 11:36
Taein Kim¹, Sukhyun Han¹, Seongwook Kim², Gwangeun Byeon³, Jungmin Lee¹, Seokin Hong¹

¹ Sungkyunkwan University ; ² Sungkyunkwan university ; ³ Sungkyunkwan University(SKKU)

LLM inference often exceeds on-chip memory capacity, causing frequent external memory access. Quantization reduces memory cost but loses accuracy due to outliers. Prior mixed-precision accelerators address this issue with encoding schemes, but often result in accuracy degradation for LLMs and pipeline stalls. We present Lupin, an algorithm-architecture co-design with Outlier-First Encoding, which stores outliers in high precision by reallocating less critical normal values. This preserves maximal outlier representation and enables stall-free execution with low-precision MAC units. Experiments show that Lupin maintains accuracy while achieving a 2.02× speedup.
TS04.10 MAPLE: Modality-Aware Projection-free LiDAR-Camera Fusion for 3D Vehicular Object Detection 11:37
Abhishek Balasubramaniam¹, Sudeep Pasricha¹

¹ Colorado State University

Accurate 3D object detection (3D-OD) is critical for autonomous vehicles, yet embedded platforms impose strict latency, power, and memory constraints. While LiDAR–camera fusion improves robustness, existing approaches depend on precise calibration and computationally expensive view projections. We present MAPLE, a projection-free and calibration-resilient fusion framework that adaptively balances LiDAR geometry and camera semantics using Gated Confidence Fusion (GCF) and low-rank adapter (LoRA) enhanced global attention refinement. MAPLE preserves fine-grained cross-modal interactions without view lifting and injects long-range context at low cost. On the nuScenes benchmark, MAPLE improves mean Average Precision (mAP) by up to 1.6% over the strongest prior fusion baseline, while reducing inference latency by 42.6% and energy consumption by 47% on the NVIDIA Jetson Orin Nano, demonstrating suitability for real-time embedded autonomous perception.

BPA01 Computing in Memory: from 3D integration to advanced arechitecture 11:00 - 12:30 | Nabucco

Chair: Nima TaheriNejad (Heidelberg University, DE)

Co-Chair: Rajendra Bishnoi (Delft University of Technology, NL)

BPA01.1 3D Integration of Hybrid IGZO/Si and IGZO eDRAMs for High-Density/High-Performance On-Chip Memory 11:00
Munhyeon Kim¹, Sukhyun Choi², Yulhwa Kim³, Jae-Joon Kim²

¹ Seoul National University of Science and Technology ; ² Seoul National University ; ³ Sungkyunkwan University

The growing need for advanced memory architectures leveraging 3D integration has become increasingly critical in modern computing systems. In particular, memory architectures that match the performance of static random access memory (SRAM) while significantly increasing density are highly impactful. In this paper, we propose a 3D integration-based hybrid InGaZnO(IGZO)/Si embedded dynamic random access memory architecture (Hybrid-3D) and circuit design, which markedly increases on-chip memory density and enhances system performance. The superiority of Hybrid-3D is demonstrated through rigorous validation involving process integration verification, transistor-level modeling, and circuit-level memory design. Detailed evaluations of the vertically stacked memory operation confirm stable operations, enabling a 22× increase in on-chip memory density compared to SRAM. Integrating Hybrid-3D on-chip memory into neural processing unit (NPU) architectures results in substantial improvements in energy efficiency and processing speed. System-level evaluations across vision and natural language processing (NLP) tasks reveal a maximum energy efficiency improvement of 3.2× and a throughput increase of 2.6×.
BPA01.2 An Operator-Circuit Co-design Digital SOT-MRAM Computing-in-Memory Accelerator with Double Bit Density and Full-Utilized Bandwidth/Throughput 11:20
Tianshuo Bai¹, Jingcheng Gu¹, Lehao Tan¹, Wente Yi¹, Haolin Ge², Han Zhang¹, Zhenyu Xue¹, He Zhang¹, Na Lei¹, Biao Pan³

¹ Beihang University ; ² Beijing University of Aeronautics and Astronautics ; ³ Beihang Unversity

Computing-in-Memory (CIM) demonstrates exceptional performance on edge AI applications, owing to its in-situ computation capability with minimal data transfer consumption. However, volatile CIMs suffer from inevitable data retention power overhead, while non-volatile MRAM-CIMs still necessitate periodic weight updates constrained by limited memory space, diminishing the intrinsic advantage of CIMs. In this work, we propose a digital SOT-MRAM CIM accelerator with circuit-architecture-operator cross-layer design, achieving double bit density and full utilization of both data transmission bandwidth and computing throughput, thereby satisfying the stringent hardware demands for edge AI applications. Firstly, we propose a refined 2T-1MTJ non-complementary memory cell with an XOR-integrated pre-charged sense amplifier (X-SA), which significantly promotes the storage density and consumes only 6.284 fJ per read-based XOR operation. Then, we devise a channel-flatten data mapping (CFDM) scheme and an operator-aware residual fusion (OARF) structure to full utilize the storage and computing resources. Furthermore, an operator fusion method towards non-linear layers is proposed, achieving an 89.84% size reduction of non-binary parameters. System-level simulations at 40nm demonstrate that our work achieves 284.25 TOPS/W energy efficiency and 5.41 TOPS/mm2 area efficiency with an accuracy of 98.72% (87.78%) on MNIST (CIFAR-10) dataset.
BPA01.3 FSDB: A Folded-Store Dynamic-Broaden Hybrid Compute-in-ROM/SRAM Architecture for Deploying Large-Scale DNNs On-Chip 11:40
Tianyi Yu¹, Teng Yi¹, Huazhong Yang¹, Xueqing Li¹

¹ Tsinghua University

Compute-in-Memory (CiM) has emerged as a promising paradigm to overcome the memory bottleneck of von Neumann architectures in data-intensive applications. While SRAM-based CiM benefits from mature fabrication support and high design flexibility, it suffers from significant access energy due to limited memory density. Recent advances in ROM-based CiM provide a high-density, energy-efficient alternative for deploying entire deep neural network (DNN) models on-chip, often assisted by small SRAM CiM modules to enhance task-level flexibility. However, existing ROM CiM architectures still face critical challenges in further scaling memory density and achieving finer-grained flexibility improvement. This paper presents FSDB, a digital hybrid ROM/SRAM CiM architecture to address these limitations. FSDB incorporates a folded-store compressed ROM CiM macro implemented using a sparsity-aware quantization methodology, achieving a record-high memory density of 40.2 Mb/mm² in a 28nm CMOS technology. Furthermore, the proposed dynamic-broaden computing architecture enables updates to parameters stored in ROM, providing kernel-level reconfigurability and cross-model scalability. Experimental results on an extended ResNet-50 demonstrate that FSDB improves inference accuracy by >5% on ImageNet compared to prior state-of-the-art (SOTA) flexible ROM CiM architectures.

FS01 Focus Session: European Startups on Quantum - Is the Breakthrough Near? (Panel) 11:00 - 12:30 | Auditorium

After decades of academic research on quantum computing and the latest Nobel prize for physics for the research on Josephson junction, quantum computing has transitioned into an emerging commercial frontier. While industry giants such as Google, IBM, Intel, and Microsoft dominate the field, several European start?ups are advancing their own hardware and software solutions. The European Commission?s funding programmes, coupled with a growing talent pipeline, create a unique opportunity for a sustainable quantum ecosystem in Europe. The panel participants are founders or high level representatives of European quantum companies. The aim of the panel is to present these companies to the DATE audience, highlight technological profiles and capabilities and their expectations how they want to compete with big companies. What is the realistic current status of quantum computing and what is the vision, how quantum computing will look like in the near future and it will fulfill its expectations, these questions will be addressed by the panel participants. The audience will be invited to participate in the discussion, if quantum computing will have a prosperous future or will become a technological dead end.

Chair: Anton Klotz (Fraunhofer, DE)

Co-Chair: Andrea Kells (Arm, UK)

FS01.1 Greetings and Introduction of the Session 11:00
Anton Klotz¹

¹ Fraunhofer EMFT
FS01.2 Building industrial-scale quantum computing on silicon 11:05
Cyril Condemine¹

¹ Quobly

Quobly is a French deep-tech startup based in Grenoble. Its mission is as ambitious as it is logical: to build the world's first truly scalable quantum computer by leveraging the very material that powered the digital revolution - Silicon
FS01.3 Fault-Tolerant Quantum Computers with cat qubits 11:15
Felix Rautschke¹

¹ Alice & Bob

Alice & Bob is a cutting-edge French-American quantum computing company headquartered in Paris and Boston. Founded in 2020 as a spin-off from ENS Paris and CNRS, the company is on a mission to build the world's first universal, fault-tolerant quantum computer. While others focus on simply adding more qubits, Alice & Bob is focused on making those qubits actually work without errors.
FS01.4 IQM's Roadmap to Industry Relevant Use Cases 11:25
Ulrich Meier¹

¹ IQM

IQM Quantum Computers is a European leader in the field of superconducting quantum computing. Headquartered in Espoo, Finland, and with a strong presence in Munich, Paris, and Madrid, IQM has transitioned from a research spin-off (Aalto University/VTT) into a global industrial scale-up. As of early 2026, it is officially the first European quantum hardware company to go public.
FS01.5 Unlocking Exponential Compute: How Cryo-CMOS technology turbocharges Quantum Computing 11:35
Brendan Barry¹

¹ Equal1

Equal1 is a global quantum semiconductor company with deep roots in University College Dublin (UCD) and headquarters in Fremont, California. Founded in 2017, the company is on a mission to democratize quantum computing by making it as accessible and cost-effective as today's CPUs and GPUs. Their philosophy is simple: rather than building massive, lab-bound machines, they aim to shrink an entire quantum computer onto a single silicon chip.
FS01.6 The Munich Quantum Software Company: Developing Production-ready Quantum Computing Software 11:45
Robert Wille¹, Marcel Walter¹, Simon Hofmann¹, Patrick Hopf¹, Marc Messing¹, Lukas Burgholzer¹

¹ Technical University of Munich

Quantum computing is becoming a reality. Superconducting, ion traps, neutral atoms, etc.—the hardware is getting there! However, software capable of handling complex design tasks is needed to connect end users to these platforms. Unfortunately, software for quantum computing is still in its infancy, and the development of quantum computing software remains a significant challenge. The MQSC aims to create production-ready software tools that provide for quantum computing what we already take for granted in classical IT.
FS01.7 Panel Discussion 11:55
Anton Klotz¹

¹ Fraunhofer EMFT

LKS01 Later ? with the keynote speakers 11:00 - 12:00 | Turandot

Wrap up the opening keynote presentations with an open and informal Q&A. Meet the speakers, ask your questions, and take part in a lively conversation that continues the ideas sparked during the keynotes.

Chair: Valeria Bertacco (University of Michigan, US)

Co-Chair: Robert Wille (TUM, DE)

LKS01.1 It's Time to Futureproof our Prosperity by Superfueling Innovation, Enabling Next-Gen AI
Luc Van de Hove¹

¹ IMEC

The AI field is evolving at an incredibly fast pace, with major models and updates being released almost every month. As these models evolve beyond Large Language Models towards next-gen AI with advanced reasoning capabilities, compute systems struggle to handle the heterogeneous workloads in a performant and sustainable way. However, developing new, AI-optimized compute architectures and the enabling semiconductor technologies takes much more time than writing algorithms. To prevent bottlenecks slowing down AI-based advancements, we must reinvent compute architectures and semiconductor technology platforms. The presentation will shed light on the need for flexible, versatile compute architectures implemented in flexible, versatile technology platforms while addressing the increasing challenges of density, power and memory. To speed up both advanced semiconductor technology R&D and full stack innovation for future AI applications, imec is expanding its pilot line infrastructure under the EU Chips Act. Next to new infrastructure, imec aims to boost innovation through intensified collaborations with complementary knowledge partners and through further internationalization, attracting global talent and building strong, local ecosystems for diverse application domains, like health and automotive. Transformative innovations for humankind hinge on the innovation pace of the semiconductor industry. It's time to supercharge our innovation engine, it's time to futureproof our prosperity.
LKS01.2 Ultra-Low Latency Real-Time Processing for Quantum Computing
Bettina Heim¹

¹ Nvidia

Quantum computing is one of the most promising technologies of our era. The fundamentally different nature of quantum processors (QPUs) requires a highly interdisciplinary effort to tackle challenges across the entire system to effectively leverage quantum resources. To realize the commercial potential of quantum acceleration, significant HPC resources are required to accelerate workloads necessary to the operation of a QPU. These workloads involve tasks related to quantum error correction, system calibration, and optimal control, and hence require and ultra-low latency exchange between the quantum system controllers (QSCs) and traditional processors such as GPUs. Introducing a tight coupling between HPC and QPU environment presents a number of challenges that have become a focus within the quantum-HPC community. Most QSCs are implemented using FPGAs or RFSoC devices, which run a firmware-defined pulse processor unit (PPU). Connecting the QSC via PCIe card can enable direct communication but poses scaling challenges. A potentially preferable alternative that offers better options for distribution is the use of a network interface card (NIC) for connection via Ethernet or InfiniBand. In this talk I will discuss NVIDIA's effort to facilitate the development of advanced QPUs using real-time processing on GPUs using the NVQLink architecture. NVQLink leverages the RDMA over Converged Ethernet (RoCE) protocol to bypass traditional network stacks and CPU involvement, enabling sub-microsecond data transfer. This is essential to achieve real-time quantum error correction, where latency tolerances are of the order of tens of microseconds for some QPU architectures. While most existing solutions are limited to using FPGA for real-time processing, the availability of real-time compute on GPUs greatly facilitates the use of machine learning and AI for automation and accuracy improvements. First realizations of NVQLink systems thus enable a big step forward towards achieving fault tolerant quantum computing at scale by enabling data-driven research and co-develop of hardware and software solutions to achieve the necessary throughput and latency for commercial applications. Bettina Heim is a systems software engineering manager at NVIDIA where she leads the CUDA-Q Engineering team. Throughout her career, she has initiated and advanced industry efforts to develop standards and toolchains for a range of quantum architectures. After her Ph.D. in quantum physics at ETH Zurich, she joined Microsoft where she led the compiler and runtime development within Azure Quantum.

YPP01 University Fair & Student Teams Fair 11:00 - 14:00 | Figaro

This session hosts two initiatives of the Young People Programme: the University fair and the Student Teams Fair. The University Fair is a platform for disseminating mature academic research activities, particularly demonstrations and prototypes that are ready for transfer to other units in academia or industry. The Student Teams Fair is an opportunity for selected university student teams involved in international competitions to present their success stories and challenges, and to receive support from companies and DATE attendees. The session includes short pitches followed by an interactive exhibition with demonstrations and posters, encouraging discussion, networking, and potential collaboration.

Chair: Sara Vinco (Politecnico di Torino, IT)

YPP01.1 Session introduction 11:00
Sara Vinco¹

¹ Politecnico di Torino
UF01.1 Systolic-ONN: A Live Demonstration of Crossbar-Free, Energy-Efficient Oscillatory Neural Preprocessing for On-Sensor AI 11:05
Jeongmin Jin¹, Mundo Jeong¹, Eunsu Cho¹, Chaebin Jung¹, Kyeongpil Min¹, Sangmin Jeon¹, Sechan Park¹, Woojoo Lee¹

¹ Chung-Ang University

Physical AI systems increasingly rely on on-sensor AI, where sensor inputs must be processed under tight form-factor and power/compute constraints. We demonstrate an ONN-based front-end preprocessor that restores distorted sensor inputs (noise and/or partial missing data) before they are forwarded to a lightweight back-end model. A key challenge is that ONN restoration relies on fully-connected coupling, which is costly to implement digitally due to global interconnect and control overhead. Our Systolic-ONN maps fully-connected Kuramoto ONN coupling onto a crossbar-free systolic structure using fixed neighbor-to-neighbor streaming enabled by a sin/cos decomposition and local (output-stationary) accumulation. We implement the design in RTL Verilog, integrate it with a RISC-V Rocket core, and prototype it on a Genesys2 FPGA. In the live demo, the prototype shows real-time convergence to learned stable patterns and input restoration, achieving 3.68× lower coupling latency and 61.8% fewer LUTs than a time-multiplexed digital ONN baseline.
UF01.2 ML-KEM on a 22 nm ASIC: Masked and Hardware-Accelerated Implementations 11:10
Stefano Di Matteo¹, Simon Pontiè¹, Emanuele Valea²

¹ CEA-Leti ; ² CEA-List

Post-Quantum Cryptography (PQC) is becoming a key building block for future secure systems, as quantum computers threaten the security of widely deployed public-key cryptographic algorithms. In response to this challenge, the NIST-led standardization process has selected new quantum-resistant algorithms, among which ML-KEM (previously known as Kyber) plays a central role for key establishment. The practical deployment of these algorithms on real hardware platforms raises significant challenges in terms of performance, security, and implementation trade-offs. Depending on the targeted use cases, such as cloud infrastructures, general-purpose applications, or embedded systems, the requirements for PQC implementations can vary significantly. While cloud and application-level deployments primarily emphasize throughput and scalability, embedded systems face more stringent constraints due to limited computational resources and exposure to physical attacks. In this context, resistance against side-channel attacks becomes a critical requirement, necessitating the integration of dedicated countermeasures in addition to performance and area considerations. Countermeasures against physical attacks, whether implemented in software or hardware, need to be validated on the target platform, as their effectiveness can vary significantly depending on the underlying technology, e.g., between FPGA and ASIC implementations. In this context, CEA has developed a 22 nm ASIC chip: VASCO3. One part of this chip is intended to experimentally evaluate the effectiveness of side-channel countermeasures for PQC on silicon. The chip integrates a RISC-V–based System-on-Chip (SoC) and several hardware accelerators dedicated to ML-KEM, enabling the exploration of multiple implementation strategies within a unified platform. In the proposed demonstration, we present a comprehensive exploration of ML-KEM executions across multiple implementation paradigms on the same ASIC platform. We first showcase unprotected implementations, ranging from a pure software realization running on the RISC-V core, to a hardware-accelerated solution, and finally to a fully dedicated ML-KEM hardware accelerator. We then extend the demonstration to protected implementations based on first-order masking, illustrating both a masked software implementation and a masked hardware-accelerated design. By executing all variants on the same 22 nm ASIC, the demo provides an experimental assessment of the impact of side-channel countermeasures on performance and design complexity. Overall, the demonstration highlights the practical trade-offs between security, efficiency, and implementation effort when deploying PQC in embedded ASIC platforms.
UF01.3 Near Memory Computing Architecture for AI-based Eye Tracking and Autonomous Control Modules 11:15
Riccardo Alidori¹, Thaddee BRICOUT², Maha Kooli³, Alejandro Gloriani⁴, Bastian Weiss⁴, Turgut Yaylali⁴, Felix Resch⁵, radu grosu⁶, Tadej Murovic⁷, Manuel Zaera-Sanz⁸, Andres Tomas⁸, Enrique Quintana⁸, Pedro Lopez⁸, Carles Hernandez⁸

¹ Univ. Grenoble Alpes, CEA, LIST, F-38000 Grenoble ; ² CEA ; ³ CEA/LIST ; ⁴ Viewpointsystem GmbH ; ⁵ Technische Universität Wien ; ⁶ Vienna University of Technology ; ⁷ Codasip ; ⁸ Universitat Politecnica de Valencia

This demonstrator is part of the functional prototyping platform of the NimbleAI EU project that explores system-level trade-offs in a 3D integrated sensing-processing neuromorphic architecture dedicated for computer vision applications. The top layer in this architecture senses light and delivers visual event flows to downstream processing engines in the interior layers to achieve efficient end-to-end cognition. In this paper, we present this processing engines that is designed based on Near-Memory Computing (NMC) architecture within a RISC-V processor. Then, we deploy it in on an FPGA-based functional prototype to run AI-based use-cases for Eye Tracking and Human Attention modules.
UF01.4 ShadowFI: A hyperscaler framework for resilience analysis and functional safety on complex digital designs 11:20
Juan Guerrero Balaguera¹, Maynor Ballina², Oguz Sensoz¹, j. Esteban Rodriguez Condia¹, Robert Limas Sierra¹, Maria Liz Crespo³, Sergio Carrato⁴, Matteo Sonza Reorda⁵

¹ Politecnico di Torino ; ² Abdus Salam International Centre for Theoretical Physics (ICTP), Universita degli Studi di Trieste ; ³ Abdus Salam International Centre for Theoretical Physics (ICTP) ; ⁴ Universita degli Studi di Trieste ; ⁵ Politecnico di Torino - DAUIN

Transistor miniaturization allows for more complex functionalities in integrated circuits (ICs) while reducing area and energy usage. However, this scaling increases the risk of hardware faults that can compromise system reliability. Fault injection (FI) techniques are used to evaluate these faults, but traditional FI methods often demand significant computational resources, limiting their effectiveness for dense IC designs. Thus, efficiently assessing the impact of hardware faults in large-scale IC designs is essential for developing effective fault-tolerance strategies. This work presents SHADOWFI, a generic, open-source, and netlist-based framework that exploits in-circuit emulation strategies in hyperscale systems to significantly reduce the performance costs associated with the characterization and evaluation of fault vulnerabilities in large ICs. SHADOWFI offers two functional workflows: i) emulation-based evaluation, which cleverly leverages on the flexibility of FPGA clusters to deploy large IC designs and perform fault evaluations (through an automatic and deterministic saboteur insertion), FI campaign execution, and report generation under minimal user configuration, and ii) simulation-based evaluation, which enables the parallelization of FI tasks on High-Performance Computing (HPC) systems, as well as, PC systems. Furthermore, the SHADOWFI framework is versatile and supports several fault models and allows extended custom models according to evaluation needs. Extensive operational and performance evaluations show that SHADOWFI dramatically accelerates FI campaigns on representative IC designs, cutting fault characterization time by two to three orders of magnitude compared to conventional fault evaluation methods. During fault-injection orchestration, SHADOWFI's simulation and FPGA-based emulation workflows consistently outperform traditional RTL fault-injection approaches, enabling faster, more scalable, and more efficient reliability analysis. Designed as an open-source framework, SHADOWFI empowers the reliability characterization of modern, large-scale IC designs across a broad range of application domains—from multimedia processing, artificial intelligence hardware up to complex architectures, such as bit-serial matrix–vector multipliers (MVMs) in Long Short-Term Memory (LSTM) accelerators, where temporal fault propagation introduces unique and demanding reliability challenges. Demonstration An interactive, in-person demonstration featuring the ShadowFI framework for IC fault characterization through emulation on a remote FPGA-based hyperscalar system, a remote HPC machine, and a local PC.
UF01.5 FPGA Deployment of a Transformer for Real-Time Thermal Map Generation 11:25
Dimitra Galani¹, Aristotelis Tsekouras¹, Angelos Athanasiadis¹, Nikolaos Tampouratzis², Vasilis Pavlidis¹

¹ Aristotle University of Thessaloniki ; ² International Hellenic University

Heterogeneous integration supports disparate fabrication technologies with densely packed components that can exhibit high sensitivity to temperature variations. Hence, accurate and fast thermal monitoring approaches are required. Recently proposed neural network methodologies have been shown to improve the accuracy of thermal map generation compared to conventional interpolation-based methods. However, the deployment of such techniques in hardware has not been demonstrated in literature. Our university fair demonstration includes the implementation of a transformer model on an FPGA board, targeting real-time thermal map generation of a PCB and enabling efficient thermal monitoring critical for modern electronic systems. Thermal map generation is achieved by capturing temperature data from the PCB through a grid of thermal sensors. These data are then fed to the transformer. The transformer has been trained with data collected from experiments in which power resistors have been used to simulate various thermal loads. The model is deployed on the KR260 platform, achieving high power efficiency.
UF01.6 LLM-based HDL Code Generation via Reasoning-Enhanced Fine-tuning: A Synthetic Data Pipeline 11:30
Weihua Xiao¹

¹ New York University

Hardware Description Language (HDL) code generation remains challenging for Large Language Models (LLMs) due to limited training data and complex design semantics. We present a unified pipeline for synthesizing reasoning-enhanced training datasets applicable to both Verilog RTL generation and SystemVerilog Assertion (SVA) synthesis. Our framework combines frontier LLMs with formal verification tools to automatically generate tuples of {natural language, reasoning trace, HDL code, correctness label}. By leveraging chain-of-thought reasoning and formal equivalence checking, we create high-quality synthetic datasets that enable effective fine-tuning of lightweight models. Fine-tuning on our synthetic datasets achieves 26.5% improvement for Verilog generation and 59.05% improvement for SVA synthesis, demonstrating the effectiveness of reasoning-enhanced learning across HDL domains. Our fine-tuned models are publicly available on HuggingFace.
UF01.7 CASlab University of Thessaly 11:35
Dimitrios Tsalapatas¹, Katerina Tsilingiri¹, Nikolaos Chatzivangelis¹, Kainat Naeem¹, Nikolaos Sketopoulos¹

¹ University of Thessaly

CASlab (University of Thessaly) focuses on Electronic Design Automation and reliable microelectronics, and develops advanced methodologies for design automation, testing, and robustness across system to device levels in both analog and digital domains. The goal of CASlab is to become a globally recognized center of excellence, advancing cutting-edge algorithms for 3D integration, timing analysis, reliability and asynchronous designs, bridging academic innovation with industrial impact and nurturing the next generation of VLSI experts. CASlab is actively involved in major EU initiatives and in the Hellenic Chips Competence Centre, and it collaborates with leading academic and industrial partners, including Qualcomm, Google, IMEC, and IHP, on cutting-edge EDA and chip design challenges.
UF01.8 Demonstrating Compiler-based Radiation Hardening for Real-Time Operating Systems 11:40
Davide Baroffio¹, Federico Reghenzani¹

¹ Politecnico di Milano

Politecnico di Milano presents ASPIS, a compiler-based solution for radiation hardening of real-time operating systems in avionics environments. The approach addresses radiation-induced transient faults (SDCs) using Software-Implemented Hardware Fault Tolerance (SIHFT), avoiding costly hardware-based solutions. ASPIS is an LLVM-based open-source tool that applies architecture- and language-independent code transformations for data- and control-flow protection. Experimental results show strong reliability improvements, reducing SDC rates to below 1% (down to 0.015%) with limited performance overhead. The demo features live fault injection on an STM32 running FreeRTOS, showcasing real-time protection of both applications and OS components.
ST01.1 HERMES Team 11:45
Christos Belogiannis¹, Vasileios Stergioulis¹, Hélène-Amélie Perakis¹, Stefanos Manos¹, Georgios Kapakos¹, Athanasios Zounidis¹

¹ University of Thessaly

HERMES Team (University of Thessaly) develops HERMES EXO, a modular lower-limb robotic exoskeleton for mobility restoration in spinal cord injury patients. The project integrates embedded systems, AI-based motion recognition, real-time control, and hardware–software co-design, aligning with DATE's focus on design and embedded systems. It showcases a complete multidisciplinary engineering approach, combining research and prototyping to address real-world challenges in assistive robotics. The team has gained international recognition through competitions, invited presentations, and IEEE publications, including participation in CYBATHLON. HERMES demonstrates strong societal and scientific impact by translating advanced design methodologies into practical, human-centered assistive technologies.
ST01.2 RoboTo 11:50
Letizia D'Angelo¹, Leonardo Sabattini¹, Luca Rizzo¹, Aminadab Zerai Ghebrehiwet¹, Cem Ege Ergun¹, Alp Kale¹

¹ Politecnico di Torino

RoboTO is a student team from Politecnico di Torino specializing in mobile robotics, combining advanced motion systems, control, and AI. The team designs diverse robots for competitive scenarios, including autonomous navigation, AI-based targeting, and operation in complex environments. Their work is applied in competitions such as ARC (Autonomous Robotics Competition) and FRE, involving challenges in navigation, SLAM, and edge AI. RoboTO adopts a multidisciplinary approach with dedicated divisions for AI, control, mechanics, and hardware–software integration. The team is also active in events like Maker Faire Rome, A&T Automation & Testing, and BI-MU, showcasing innovation in autonomous systems and robotics.
ST01.3 Team H2polito 11:55
Andrea Lucarelli¹, Davide Theseider Dupré¹, Luigi Radano¹, Gianluca Piccoli¹, Giulia Polcaro¹

¹ Team H2politO

Team H2politO, one of the student teams at the Polytechnic University of Turin, is made up of approximately 100 engineering students. Its goal is to design and build highly and ultra-highly efficient vehicles to compete in an international competition: the Shell Eco-marathon. The competition comprises three vehicle categories: Prototypes, Urban Concepts, and autonomous vehicles. The team is currently competing in all three: - In the prototype category with IDRAzephyrus, the team's fifth prototype, powered by a hydrogen fuel cell; - In the Urban Concept and autonomous driving categories with JUNO, the team's third urban concept, equipped with an internal combustion engine powered by bioethanol. Last year, at the Silesia Ring (Poland), the team achieved the following results: - Third place for IDRAzephyrus with 2,500 km/L; - Third place for JUNO with 199 km/L; - Seventh place in the autonomous driving category.
ST01.4 KITcar e.V. 12:00
Darius Schefer¹, Marc Thieme², Xinzhe Li², Jonas Blasi², Sarmas Ahmedani²

¹ KITcar e.V. ; ² Karlsruhe Institute of Technology

KITcar e.V. is a university team focused on developing autonomous model cars and advancing cognitive autonomous driving technologies. The team competes in the Cognitive Autonomous Driving Challenge (CAuDri-Challenge), which they co-founded after the Carolo-Cup, achieving multiple victories and top placements. Their vehicles, including Dr. Drift and Miss Magic, excel in tasks such as obstacle evasion, autonomous navigation, and parking. With over 40 members, KITcar combines expertise in perception, navigation, and software, supported by strong collaboration with industry sponsors. The team actively contributes to the autonomous driving community through innovation, competitions, and knowledge sharing, while offering hands-on experience to students.
ST01.5 Polimi Motorcycle Factory 12:05
Luca Bendazzoli¹, Simone Cocco¹, Mattia Caon¹, Andrea Turchetto¹, Simone Bendazzoli¹

¹ Politecnico di Milano

POLIMI Motorcycle Factory (PMF) is one of the sports teams of Politecnico di Milano, comprised of over one hundred students from design, architecture, and various engineering faculties. The team participates in the international MotoStudent competition, where it designs and develops prototype racing motorcycles in the Electric category. Starting in 2025, PMF takes part in the CIV Moto3 Championship. Recent successes include the 2025 MotoStudent VIlI Edition, where the team secured 3rd place overall, 1st place for MS1 phase. Furthermore, in the 2018 MotoStudent V Edition, PMF won the title of Petrol World Champions with the Scighera bike. Supporting PMF represents an opportunity to actively contribute to the development of prototypes that will race in Aragon and on renowned Italian circuits, believe in the potential of future engineers, designers, architects, and riders, and contribute to their academic and personal development. The team actively promotes its results, with over 30 participations in trade fairs across Italy, 7 presentations at international conferences, and over 2,500,000 views on social media channels in the last season.
ST01.6 DYNAMIΣ PRC 12:10
Leonardo Vicentini¹, Paolo Franci¹

¹ Politecnico di Milano

DynamiΣ PRC is the Formula SAE racing team of Politecnico di Milano, made up students from different faculties. The team designs and builds a new single-seater race car each year, optimizing its performance and engineering according to competition rules. It competes internationally, achieving top results ,such as 1st Italian Electric team in 2023, 30th worldwide Electric team in 2023, and 4th worldwide Combustion team in 2019.
YPP01.2 Demos & Posters 12:15
Sara Vinco¹

¹ Politecnico di Torino

LK01 IEEE CEDA Distinguished Lecturer Lunchtime Keynote 13:15 - 14:00 | Auditorium

LK01.1 Unifying Accelerator Design and Programming for Evolvable Computing 13:15
Zhiru Zhang¹

¹ Cornell University

We are living through a fundamental shift in computing, where performance and efficiency gains increasingly come from specialized accelerators tailored to "hot" domains like AI. Yet as accelerator-centric computing proliferates, it continues to build atop a longstanding disconnect between the way we design these systems and the way we program them. This divide slows hardware innovation, complicates the software stack, and makes accelerators far harder to evolve than the rapidly changing applications they are meant to serve. In this talk, I will present our recent research on closing this gap through a unified hardware-software co-design stack. At the core is a new abstraction that generalizes single program multiple data beyond largely independent threads to encompass spatial structure and communication. It offers a single, composable model for expressing both how accelerators are constructed and how they are programmed. I will show how this abstraction is realized in Allo, an open-source, Python-based, MLIR-powered framework to enable concise specifications that map efficiently across FPGA, ASIC, NPU, and GPU platforms. I will conclude with a forward-looking perspective on automated compiler construction, differentiable hardware synthesis, and agentic design automation, and discuss how these directions may help move us toward truly evolvable heterogeneous computing.

YPP02 Career Fair 14:00 - 15:50 | Figaro

The DATE 2026 YPP Industry Career Fair brings together students, PhD candidates, and early-career researchers with representatives from leading organizations across the Electronic System Design & Test landscape. In line with the Young People Programme Career Fair format, this presentation session gives sponsoring organizations the opportunity to introduce their ongoing activities, workplace culture, and career opportunities, helping attendees discover industrial trends, open positions, and future career paths. Following sponsoring organizations will participate: Alice & Bob ? Felix Rautschke Axelera AI ? Kevin Luciani & Cristina Concezzi Cadence ? Ben Woods Chipdesign Germany ? Nils Stanislawski Fondazione Chips-IT ? Giorgio Dell?Erba Infineon Technologies ? Andrea Vecchiato Micron ? Barbara Carcano Qualcomm ? Professor Christos Sotiriou Racyics ? Lei Meng STMicroelectronics ? Noemi Dipietro & Valentina Cuomo

Chair: Florian Bilstein (Racyics, DE)

YPP02_SD Career Fair - Speed dating 15:50 - 17:30 | Figaro

The Career Fair - Speed dating session allows students, PhD candidates, and early-career researchers to ask individual questions, leave a CV, or plan interviews and future contact.

Chair: Florian Bilstein (Racyics, DE)

TS05 Innovations in Cross-Layer Optimization for advanced DES 14:00 - 15:30 | Traviata

Chair: Andreas Gerstlauer (University of Texas at Austin, US)

Co-Chair: Giusy Iaria (Politecnico di Torino, IT)

TS05.1 Artemis: Co-Simulation of Power Microgrids and Energy-Aware Cloud Data Centers 14:00
Mattia Tibaldi¹, Sara Vinco², Christian Pilato¹

¹ Politecnico di Milano ; ² Politecnico di Torino

The growing demand for power to support new cloud services raises the question of how to power future data center infrastructures. A power microgrid and cloud simulator that can act as a unified digital twin of these new infrastructures is crucial for studying emerging scenarios. In this article, we propose Artemis, a co-simulation environment for power microgrids and cloud data centers. Artemis extends the combination of the CloudSim Plus simulator and the Amethyst virtual machine allocation and migration policy with a generalized power microgrid model. Ultimately, Artemis enables the study of modular power microgrids with custom electrical policies and returns performance metrics and visualizations of the data center's status under observation.
TS05.2 COFFEE: A Carbon-Modeling and Optimization Framework for HZO-based FeFET eNVMs 14:05
Hongbang Wu¹, Xuesi Chen², Shubham Jadhav¹, Amit Lal¹, Lillian Pentecost³, Udit Gupta²

¹ Cornell University ; ² Cornell Tech ; ³ Amherst College

Information and communication technologies account for a growing portion of global environmental impacts. While emerging technologies, such as emerging non-volatile memories (eNVM), offer a promising solution to energy efficient computing, their end-to-end footprint is not well understood. Understanding the environmental impact of hardware systems over their life cycle is the first step to realizing sustainable computing. This work conducts a detailed study of one example eNVM device: hafnium–zirconium-oxide (HZO)-based ferroelectric field-effect transistors (FeFETs). We present COFFEE, the first carbon modeling framework for HZO-based FeFET eNVMs across life cycle, from hardware manufacturing (embodied carbon) to use (operational carbon). COFFEE builds on data gathered from a real semiconductor fab and device fabrication recipes to estimate embodied carbon, and architecture level eNVM design space exploration tools to quantify use-phase performance and energy. Our evaluation shows that, at 2 MB capacity, the embodied carbon per unit area overhead of HZO-FeFETs can be up to 11% higher than the CMOS baseline, while the embodied carbon per MB remains consistently about 4.3× lower than SRAM across different memory capacity. A further case study applies COFFEE to an edge ML accelerator, showing that replacing the SRAM-based weight buffer with HZO-based FeFET eNVMs reduces embodied carbon by 42.3% and operational carbon by up to 70%.
TS05.3 Reshaping Bayesian Optimization of Design Space Optimization Towards Accurate and Irredundant Evaluation in EDA Tool Parameter Exploration 14:10
Chanhee Jeon¹, Taewhan Kim¹

¹ Seoul National University

Finding an optimal tool parameter configuration that achieves optimal design PPA (performance, power, area) through Design Space Optimization (DSO) in Physical Design (PD) has become increasingly important due to the rising complexity of Electronic Design Automation (EDA) tool chains in modern VLSI design. DSO is particularly challenging due to the high dimensionality of tool parameters, the abundance of discrete parameter options, and, most critically, the non-linear relationship between parameters and design PPA. This work overcomes two limitations of the prior state-of-the-art Bayesian Optimization (BO) based DSO methods for EDA tool parameter optimization. The two limitations are (1) a poor correlation of the similarity computation in BO engine between two sampling points with the actual similarity between the corresponding post-layout PPAs, resulting in far from the Pareto-optimal PPA exploration; (2) redundant evaluations occur frequently when discrete parameters are quantized. Precisely, we overcome limitation 1 by training AE (AutoEncoder) model using PPA outcomes of the prior sample points and using it to reshape the latent parameter space such that similarity in latent vectors aligns with the similarity in PPA, thereby justifying the accurate kernel-based similarity function in BO, while we address limitation 2 by reformulating the acquisition function in BO in a way to effectively sample the discrete parameter values in the continuous design space. In the meantime, through experiments, it is shown that using our DSO method with reshaped BO amenable to EDA tool parameter optimization is able to find tool parameter options of 59% larger HyperVolume and up to 16% improvement for single objective (i.e., a weighted sum of PPA) optimization.
TS05.4 EDA Flow Matters: Stage-Aware Parameter Optimization of Tool Chain 14:15
Xinheng Li¹, Donger Luo², Peng Xu³, Ziyang Yu³, Qi Sun⁴, Tinghuan Chen⁵, Bei Yu³, Hao Geng⁶

¹ Shanghai Tech ; ² Shanghaitech University ; ³ The Chinese University of Hong Kong ; ⁴ Zhejiang University ; ⁵ The Chinese University of Hong Kong, Shenzhen ; ⁶ ShanghaiTech University

Optimizing Electronic Design Automation (EDA) tool parameters with only dozens of affordable evaluations represents one of the most challenging problems in today's EDA flow management, where each experiment costs hours to days yet directly impacts final PPA outcomes. While Bayesian Optimization (BO) naturally fits such sample-constrained scenarios, it models the entire EDA flow as a monolithic formulation, blindly ignoring the sequential structure that each stage in the EDA flow affects the next. In this work, we propose a stage-aware optimization framework that fundamentally rethinks EDA parameter tuning. The proposed stage-aware Gaussian process explicitly models cascading relationships between EDA stages through interconnected GP layers, extracting abundant information from each expensive evaluation. To better meet realistic needs, we further introduce Expected Hypervolume Improvement (EHVI)-Efficiency, a time-aware acquisition function that exploits evaluation runtime estimation and EDA tools' checkpoint reuse to balance design metrics' expected improvement against EDA flow's computational cost. Experiments and ablation studies on 6 designs across 3 process nodes demonstrate the effectiveness of our proposed method.
TS05.5 FAQNAS: FLOPs aware Hybrid Quantum Neural Architecture Search using Genetic Algorithm 14:20
Muhammad Kashif¹, Shaf Khalid¹, Alberto Marchisio², Nouhaila Innan³, Muhammad Shafique²

¹ eBrain Lab, Division of Engineering, New York University (NYU) Abu Dhabi, UAE ; ² New York University Abu Dhabi (NYUAD) ; ³ New York University Abu Dhabi

Hybrid Quantum Neural Networks (HQNNs), which combine parameterized quantum circuits with classical neural layers, are emerging as promising models in the noisy intermediate-scale quantum (NISQ) era. While quantum circuits are not naturally measured in floating point operations (FLOPs), most HQNNs (in NISQ era) are still trained on classical simulators where FLOPs directly dictate runtime and scalability. In this work, we introduce FAQNAS, a FLOPs-aware neural architecture search (NAS) framework that formulates HQNN design as a multi-objective optimization problem balancing accuracy and FLOPs. Unlike traditional approaches, FAQNAS explicitly incorporates FLOPs into the optimization objective, enabling the discovery of architectures that achieve strong performance while minimizing computational cost. Experiments on five benchmark datasets (MNIST, Digits, Wine, Breast Cancer, and Iris) show that quantum FLOPs dominate accuracy improvements, while classical FLOPs remain largely fixed. Pareto-optimal solutions reveal that competitive accuracy can often be achieved with significantly reduced computational cost compared to FLOPs-agnostic baselines. Our results establish FLOPs-awareness as a practical criterion for HQNN design in the NISQ era and as a scalable principle for future hybrid quantum–classical systems.
TS05.6 Unified Pauli-Rotation Synthesis for Relieving CX-Count Overhead in Tableau-Based Quantum Circuit Optimization Flow 14:25
Yi-Hsiang Kuo¹, Hsiang-Chun Yang¹, Hsin-Yu Chen¹, Chung-Yang (Ric) Huang¹

¹ National Taiwan University

Tableau representation offers an efficient framework for describing quantum circuits and has been widely adopted in tableau-based quantum circuit optimization (QCO) flows. While these flows can substantially reduce the T-count, which is critical for fault-tolerant implementations, resynthesizing the optimized tableaux back into circuits often introduces excessive two-qubit gates (2Q-gates), leading to significant 2Q-count overhead. To address this issue, we propose a unified synthesis strategy that departs from the conventional tableau-by-tableau approach. Instead of resynthesizing each tableau in isolation, our method consolidates Clifford and Pauli rotation tableaux and applies a holistic resynthesis algorithm. This unified treatment contrasts with prior approaches and enables systematic reduction of the overall 2Q-count. Experimental results on standard Clifford+T benchmarks show that our method achieves a Geomean 2Q-count ratio of 1.28, compared to 4.24 for TKET and 1.93 for LazySynth—the state-of-the-art tableau synthesis approach—, demonstrating that unified synthesis effectively mitigates the 2Q-gate overhead in tableau-based QCO.
TS05.7 ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training 14:30
Kangbo Bai¹, Zhantong Zhu¹, Yifan Ding¹, Tianyu Jia¹

¹ Peking University

In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile, optical interconnect (OI) technology offers long-reach, high-bandwidth links, making it well suited for scale-out networks. The combination of these two technologies has the potential to overcome communication bottlenecks within and across packages. In this work, we present ChipLight, a cross-layer multi-objective design and optimization method for training clusters leveraging chiplet and OI. We first abstracts an architecture model for such complex clusters, co-optimizing chiplet architecture, training parallel strategy, and OI network topology. For complex optimization workflow, we tailor the design space exploration flow by combining both black-box and white-box methodology. Evaluated by our experimental results, ChipLight achieves design with significantly enhanced training efficiency and provides a few valuable design insights for the development of future training clusters.
TS05.8 Cyber-Physical System Design Space Exploration for Affordable Precision Agriculture 14:35
Pawan Kumar¹, Hokeun Kim²

¹ Arizona state university ; ² Arizona State University

Precision agriculture promises higher yields and sustainability, but adoption is slowed by the high cost of cyber-physical systems (CPS) and the lack of systematic design methods. We present a cost-aware design space exploration (DSE) framework for multimodal drone–rover platforms to integrate budget, energy, sensing, payload, computation, and communication constraints. Using integer linear programming (ILP) with SAT-based verification, our approach trades off among cost, coverage, and payload while ensuring constraint compliance and a multitude of alternatives. We conduct case studies on smaller and larger-sized farms to show that our method consistently achieves full coverage within budget while maximizing payload efficiency, outperforming state-of-the-art CPS DSE approaches.
TS05.9 Design and Optimization of Solar-Powered Embedded Systems with Uppaal Stratego 14:40
ISMAEL SAMAYE¹, Abdoulaye GAMATIÉ²

¹ LIRMM- University of Montpellier -CNRS ; ² LIRMM- University of Montpellier - CNRS

Energy intermittency in solar-powered embedded systems threatens Quality of Service (QoS) and system autonomy. In this study, we address the design of these systems with a formal co-design approach that provides verifiable guarantees, a critical advantage over traditional heuristic or predictive methods that often fail under unpredictable conditions. We use timed automata-based modeling within Uppaal Stratego to minimize grid reliance and battery capacity in a typical system, under QoS guarantees. Our methodology demonstrates that synthesized control strategies can reduce grid reliance by 58-72%, while an optimized task scheduling heuristic can decrease required battery capacity by up to 13% compared to the baseline. Our approach provides a formal basis for comparing these techniques to inform system design.

TS06 Thermal and Power Integrity Modeling for Advanced 3D/2.5D IC Systems 14:00 - 15:30 | Tosca

Chair: Abhijit Chatterjee (Georgia Tech, US)

Co-Chair: Michael Niemier (University of Notre Dame, US)

TS06.1 EMaper: Cross-level Electromigration Aware Placement and Routing EDA Workflow for Interconnects Hotspot Prediction and Mitigation 14:00
Chenglin Ye¹, Yuze Lu¹, Yizhan Liu¹, Ligong Zhang¹, Jinghan Xu², Fei Liu¹, Yibo Lin², Zheng Zhou¹, Xiaoyan Liu¹

¹ School of Integrated Circuits, Peking University ; ² Peking University

Electromigration (EM) has emerged as a critical reliability concern in advanced technology nodes. Yet the industry-standard Black's equation lacks generality because its fitted parameters depend strongly on interconnect geometry and can only applied for single segment. In addition, conventional post-layout "analyze-then-fix" verification is reactive and ill-suited to varying mission profiles with stringent EM targets. In this work, we propose EMaper, a cross-level EM-aware optimization framework that couples physics-grounded modeling with early placement and routing. EMaper introduces: (1) a physics-informed EM compact model and thermal simulator, enabling accurate EM estimation across varying geometries and conditions and full-chip thermal analysis; (2) a physically predictive placement and routing framework integrated with physical models. (3) A cross-level framework—spanning from the atomic scale to the physical model to EDA flow—enabling in-design violation prediction and mitigation across varying application scenarios. Experimental results on the ISPD2018 benchmarks demonstrate that EMaper eliminates 92.1% to 100% of EM violations across a variety of operating conditions, with only 4.49% to 16.3% overhead in wirelength and via count. Furthermore, in thermally benign scenarios (e.g., 300 K), EMaper naturally incurs zero overhead, reflecting its self-adaptive optimization capability. These results highlight EMaper's potential as a practical and scalable solution for EM-aware physical design in modern VLSI flows.
TS06.2 ETLA-3D: Equivalent Thin Layer Aggregation based Thermal FEM for Hybrid Bonding F2F 3D ICs 14:05
Chenghan Wang¹, Zhen Zhuang², Kai Zhu³, Darong Huang³, Luis Costero⁴, Rongmei Chen⁵, David Atienza³, Tsung-Yi Ho²

¹ Chinese University of Hong Kong ; ² The Chinese University of Hong Kong ; ³ École Polytechnique Fédérale de Lausanne (EPFL) ; ⁴ Dpto. de Arquitectura de computadores y Automática. Universidad Complutense de Madrid ; ⁵ Peking University

In 3D face-to-face (F2F) hybrid bonding ICs, sub-micrometer thin layers lead to an extreme aspect ratio between the lateral dimensions and the vertical thickness. This poses major challenges for finite element method (FEM) thermal simulation. To address this, we introduce ETLA-3D, a thermal FEM methodology based on equivalent thin-layer aggregation, designed specifically for hybrid bonding F2F 3D ICs. The method consolidates the physical properties of thin layers into their neighboring layers by introducing new integral terms into the FEM weak form, greatly reducing the complexity of meshing, the simulation degrees of freedom (DoFs) and the computational cost, while preserving accuracy. Experimental results show that ETLA-3D achieves up to 695.8× faster runtime compared to the commercial FEM tool (COMSOL Multiphysics), with a maximum absolute error of less than 1.1 °C. By combining high accuracy with exceptional efficiency, ETLA-3D establishes a reliable and efficient FEM framework to model the thermal behavior of F2F 3D ICs.
TS06.3 Consolidating ML-driven Early IR-drop Mitigation for Fast and Reliable IR-drop Closure 14:10
Munwon Lee¹, Chanhee Jeon¹, Taewhan Kim¹

¹ Seoul National University

This work proposes a new methodology of ML (machine-learning) driven early i.e., pre-layout IR-drop mitigation to make a fast and reliable IR-drop convergence at post-layout. Our methodology is intended to facilitate two important issues that the conventional IR-drop mitigation flow has not fully and effectively taken into account. Those issues are (1) how we can relieve the burden of excessive use of metal resources for robust power delivery network (PDN) and (2) how we can reduce the iteration count of the very slow process of layout simulation followed by incremental IR-drop mitigation at the post-layout optimization stage. Precisely, to make our early IR-drop mitigation method accurate and reliable, we devise two core components: (1) IR-drop cost formulation to be used at the global placement through comprehensive analysis of the sources of IR-drop and (2) post-layout IR-drop prediction at placement by developing GNN (graph neural network) based prediction model. Through experiments with benchmark circuits, it is shown that using our early IR-drop mitigation method is able to reduce IR-drop violations by 25.7% on average and Worst DVD (Dynamic Voltage Drop) by 7.1% on average while maintaining the same or even better chip PPA over that produced by the conventional commercial IR-drop mitigation flow.
TS06.4 FALCON-3D: Full-Chip Analytical Thermal Simulation with Lateral Convection for 3D-Stacked ICs 14:15
Tsung-Lin Lu¹, Yu-Min Lee¹, Pei-Yu Huang¹, Ching-Hsiang Wang¹

¹ National Yang Ming Chiao Tung University

As the power density and complexity of modern chips continue to increase, thermal analysis has become an essential step in the design process. While existing analytical approaches assume purely vertical heat flow, lateral heat transfer becomes significant when chip thickness increases, as in 3D ICs, and cooling capability is limited, as in mobile devices. Though commercial numerical tools can capture these effects, they are too computationally intensive for use in early design stages. This work proposes FALCON-3D, a high-performance and full-chip analytical thermal solver tailored for early-stage design analysis, which explicitly models lateral surface heat transfer. Experimental results demonstrate not only the computational efficiency of FALCON-3D but also that ignoring lateral heat transfer introduces notable errors in temperature prediction, underscoring the importance of incorporating lateral effects.
TS06.5 3D-ICE 4.0: accurate and efficient thermal modeling for 2.5D/3D heterogeneous chiplet systems 14:20
Kai Zhu¹, Darong Huang¹, Luis Costero², David Atienza¹

¹ École Polytechnique Fédérale de Lausanne (EPFL) ; ² Dpto. de Arquitectura de computadores y Automática. Universidad Complutense de Madrid

The increasing power densities and intricate heat dissipation paths in advanced 2.5D/3D chiplet systems necessitate thermal modeling frameworks that deliver detailed thermal maps with high computational efficiency. Traditional compact thermal models (CTMs) often struggle to scale with the complexity and heterogeneity of modern architectures. This work introduces 3D- ICE 4.0, designed for heterogeneous chip-based systems. Key innovations include: (i) preservation of material heterogeneity and anisotropy directly from industrial layouts, integrated with OpenMP and SuperLU MT-based parallel solvers for scalable performance, (ii) adaptive vertical layer partitioning to accurately model vertical heat conduction, and (iii) temperature-aware non-uniform grid generation. The results with different benchmarks demonstrate that 3D-ICE 4.0 achieves speedups ranging from 3.61x–6.46x over state-of-the-art tools, while reducing grid complexity by more than 23.3% without compromising accuracy. Compared to the commercial software COMSOL, 3D-ICE 4.0 effectively captures both lateral and vertical heat flows, validating its precision and robustness. These advances demonstrate that 3D-ICE 4.0 is an efficient solution for thermal modeling in emerging heterogeneous 2.5D/3D integrated systems.
TS06.6 Sensor Placement and Transformer-Based Thermal Map Generation for Reusable Interposers 14:25
Aristotelis Tsekouras¹, Theodoros Papavasileiou², Panagiotis Petrantonakis³, Georgios Keramidas⁴, Vasilis Pavlidis¹

¹ Aristotle University of Thessaloniki ; ² School of Informatics, Aristotle University of Thessaloniki, Greece ; ³ Department of Electrical and Com- puter Engineering, Aristotle University of Thessaloniki, Greece ; ⁴ Aristotle University of Thessaloniki/Think Silicon S.A., GR

2.5D integration has been a promising packaging approach intrinsically underpinning heterogeneous integration. The physical proximity of diverse components (e.g., chiplets) on interposers entails multi-physics, including thermal coupling, which affects the performance and reliability of the entire system. Consequently, interposer-level thermal monitoring is required to avoid overheating during run-time. Furthermore, reusable interposers have also recently been proposed in the literature, implying that a specific interposer is used for multiple systems. Therefore, conventional thermal sensor placement methods, developed for a specific system, are incompatible with this emerging design concept. A new flow focusing on thermal sensor allocation and thermal map reconstruction for reusable interposers is proposed. The flow utilizes a transformer neural network to reconstruct the thermal map of the interposer and hyperparameter tuning to select the appropriate thermal sensor locations that minimize the reconstruction error across the entire set of available floorplans for a specific transformer architecture. The benchmarks used to train the transformer are produced through gem5, McPat, HotSpot and TAP-2.5D for ten different floorplans, showcasing the effectiveness and generality of the approach compared with prior art and achieving an average maximum error of less than 1K.
TS06.7 Enabling Cross-Design Power Trace Prediction with GNNs for Gate-Level Netlists 14:30
Shih-Chun Lin¹, Yung-Chih Chen², Bo-Hao Huang¹

¹ National Taiwan University of Science and Technology ; ² National Taiwan University of Science and Technology; Arculus System Co. Ltd.

Accurate cycle-by-cycle power estimation plays a critical role in the early stages of chip design, facilitating power, performance, and area (PPA) optimization. Recently, machine learning (ML)-based methods have emerged as faster alternatives to traditional electronic design automation (EDA) tools. However, they often require model retraining for each new design, which limits their general applicability and efficiency during early-stage design exploration. To address this, we propose a graph neural network (GNN)-based estimator for gate-level cycle-based power prediction, designed to achieve cross-design generalization. By exploiting the GNN's ability to capture circuit structure and encoding standard cell types from the design library into node embeddings, our model effectively generalizes to unseen circuit designs without retraining. Experimental results demonstrate that our GNN-based estimator achieves over 29× faster cycle-based power estimation than commercial EDA tools, with NRMSE below 3.37% and 5.19% for zero-delay and SDF-delay scenarios, respectively.
TS06.8 FastRW: An Efficient Random Walk Method for Steady-State Thermal Analysis 14:35
Zixiao WANG¹, Tianshu Hou², Chenghan Wang³, Zhen Zhuang¹, Tsung-Yi Ho¹, Farzan Farnia¹, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² The Department of Computer Science and Engineering, The Chinese University of Hong Kong ; ³ Chinese University of Hong Kong

Thermal simulation is becoming increasingly critical in modern IC design and manufacturing. Recent random walk methods leveraging the Feynman--Kac formula enable efficient local temperature estimation without solving the global temperature field. However, in practical scenarios without Dirichlet boundary conditions, Feynman--Kac-based approaches often suffer from inefficiency due to the long random walk paths required for convergence. In this paper, we accelerate the random walk process through an exact error analysis of path truncation. By incorporating a fast, noisy prior estimation, we significantly shorten random walk paths without sacrificing accuracy. Furthermore, when temperatures at multiple points are required, the number of random walks can be further reduced by exploiting cross-information between sampling points. Compared with prior state-of-the-art methods, FastRW achieves over 6.5$\times$ acceleration while delivering improved performance.
TS06.9 Reinforcement Learning for Hybrid Bonding Terminal Legalization in 3D ICs 14:36
Wanqi Ren¹, Chengrui Gao¹, Yunqi Shi¹, Mingzhou Fan², Siyuan Xu², Ke Xue¹, Chenjian Ding³, Mingxuan Yuan², Chao Qian¹

¹ Nanjing University ; ² Huawei Noah's Ark Lab ; ³ Huawei Technologies

Hybrid bonding (HB) in 3D ICs enables scaling but introduces overlap challenges from large pitch requirements. Existing legalization methods use exhaustive sliding-window scanning, resulting in significant computational inefficiency. To address this, we propose a reinforcement learning (RL) approach that adaptively selects subregions for targeted displacement optimization. The learned policy generalizes to unseen designs without fine-tuning. Experimental results on open-source and industrial benchmarks show our method fully eliminates overlaps with minimal displacement and reduced runtime compared with baselines.

TS07 From SAT/SMT Acceleration to NoC Modeling 14:00 - 15:30 | Aida

Chair: Franco Fummi (Universita' di Verona, IT)

Co-Chair: Görschwin Fey (Hamburg University of Technology, DE)

TS07.1 Think with Self-Decoupling and Self-Verification: Automated RTL Design with Backtrack-ToT 14:00
Zhiteng Chao¹, Yonghao Wang², Xinyu Zhang³, Jiaxin Zhou⁴, Tenghui Hua⁵, Husheng Han⁶, Tianmeng Yang⁷, Jianan Mu⁸, Bei Yu⁹, Rui Zhang¹⁰, Jing Ye¹¹, Huawei Li²

¹ State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences CASTEST Co., Ltd. ; ² Institute of Computing Technology, Chinese Academy of Sciences ; ³ Institute of Computing Technologe,Chinese Academy of Sciences ; ⁴ Beijing Normal University ; ⁵ the Institute of Computing Technology, Chinese Academy of Sciences at Beijing ; ⁶ Beijing ; ⁷ Peking University ; ⁸ ICT, CAS ; ⁹ The Chinese University of Hong Kong ; ¹⁰ ICT-CAS ; ¹¹ State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences

Large language models (LLMs) hold promise for automating integrated circuit (IC) engineering using register transfer level (RTL) hardware description languages (HDLs) like Verilog. However, challenges remain in ensuring the quality of Verilog generation. Complex designs often fail in a single generation due to the lack of targeted decoupling strategies, and evaluating the correctness of decoupled sub-tasks remains difficult. While the chain-of-thought (CoT) method is commonly used to improve LLM reasoning, it has been largely ineffective in automating IC design workflows, requiring manual intervention. The key issue is controlling CoT reasoning direction and step granularity, which do not align with expert RTL design knowledge. This paper introduces VeriBToT, a specialized LLM reasoning paradigm for automated Verilog generation. By integrating Top-down and design-for-verification (DFV) approaches, VeriBToT achieves self-decoupling and self-verification of intermediate steps, constructing a Backtrack Tree of Thought with formal operators. Compared to traditional CoT paradigms, our approach enhances Verilog generation while optimizing token costs through flexible modularity, hierarchy, and reusability.
TS07.2 HiM: An Autonomous Hardware Accelerator for Solving Boolean Satisfiability Problem with a Heuristic-in-Macro Engine 14:05
Shin Han¹, Minhyeok Jeong², Yoonmyung Lee³

¹ sungkyunkwan university ; ² Dept. of Electrical and Computer Engineering, Sungkyunkwan University ; ³ Sungkyunkwan University

Boolean Satisfiability (SAT), an NP-complete problem central to EDA and AI, has motivated hardware acceleration to overcome its exponential complexity. Early approaches focused on speeding up incomplete solvers, but their inherent algorithmic limitations made them unsuitable for correctness-critical tasks. Consequently, the focus shifted to hardware accelerators for complete solvers based on the DPLL/CDCL framework, which concentrated on accelerating the primary bottleneck: the Boolean Constraint Propagation (BCP) operation. However, performance is ultimately dominated by branching heuristics. Existing designs either omit heuristics, suffering large penalties, or offload them to CPUs, incurring prohibitive overhead. This work presents Heuristic-in-Macro (HiM), the first fully autonomous SAT accelerator integrating both an efficient BCP engine and a hardware-embedded MOMs branching heuristic in a single macro, eliminating CPU dependence. A high-throughput parallel processing architecture replaces traditional serialized clause scans with a tiled multi-macro execution, achieving 8.78× acceleration. At the circuit level, physical efficiency is enhanced through a compact 16T unit cell that merges logic and storage, thereby reducing area and energy. Proposed HiM-based solver achieves 100% SAT/UNSAT solvability, 172.1× speedup in algorithmic performance compared to designs without heuristics. When matched against a CPU-offloaded hybrid system, HiM is 305.6× faster and 1.99×10^6× more energy-efficient. Compared to the widely used MiniSAT software solver, HiM delivers 26.7× speedup and 3.09×10^6× efficiency, while reducing time- and energy-to-solution by up to 94% and 83% versus state-of-the-art ASIC accelerators.
TS07.3 MISP-Net: Significantly Reducing Transient Backward-steppings via Novel Multi-step Irregular Sequence Prediction 14:10
Yichao Dong¹, Dan Niu², Chao Wang², Zhenya Zhou³, Changyin Sun⁴, Zhou Jin⁵

¹ Southeast university ; ² Southeast University ; ³ Huada Empyrean Software Co. Ltd, Beijing, China ; ⁴ Anhui University ; ⁵ Zhejiang University

In the post-layout simulation for large-scale integrated circuits, Transient Analysis (TA), determining the time-domain response over a specified time interval, is essential and time-consuming. Especially, a mass of backward steppings and low simulation efficiency occur without proper settings of Newton-Raphson (NR) initial solution and accurate Local Truncation Error (LTE) estimation. In this work, a novel multi-step irregular sequence prediction model (MISP-Net) is proposed to predict multiple NR initial solutions and precise LTE estimations by just one inference step. This model is constructed by an Irregular Multiple Timesteps Prediction Module (IMTP) and a Irregular Multi-step Solution Prediction Module (IMSP). In IMSP, to improve the irregular prediction performance, a Dual-branch Irregular Feature Pyramid (DIFP) equipped with lightweight Multi-Channel Irregular Time Attention (MITA) are designed. We assess the proposed MISP-Net in the real large-scale industrial circuits on a commercial SPICE simulator. Compared with the commercial SPICE and the SOTA ISPT-Net model, significant backward stepping reductions are achieved: up to 78.57\% for NR nonconvergence case and 76.62\% for LTE overlimit case, respectively. And the prediction time for NR initial solution in our model is remarkably reduced by up to 5.58$\times$ compared to the SOTA ISPT-Net model.
TS07.4 Accurate Analytical Modeling for NoCs with Hybrid Arbitration under High Traffic Injection 14:15
Rahul Tripathy¹, Mohammad Majharul Islam², Riad Akram², Raid Ayoub³, Sumit Mandal¹

¹ Indian Institute of Science ; ² Intel Corporation ; ³ Intel corporation

Analytical performance modeling of Networks-on-Chip (NoC) are important for fast design space exploration and quick pre-silicon evaluation. Existing NoC performance analysis techniques assume certain micro-architectural details (e.g., a particular arbitration technique) to be homogeneous across the entire NoC. However, emerging NoC architectures may have hybrid arbitration across the NoC to ensure high throughput. Moreover, existing analytical models estimating performance of NoCs with finite buffers fail to analyze the performance of the NoC accurately under high traffic injection. In this work, we propose a performance analysis technique for NoCs with hybrid arbitration under high traffic injection. We propose a novel transformation to accurately compute the waiting time of the queues under hybrid arbitration. We also develop a technique to compute the effective arrival statistics to the queues when the desired injection rate is high. Thorough experimental evaluation with a wide range of injection rates at the queues of an industrial NoC show that our proposed analytical model incurs only 7% error on average and 4 orders of speed-up with respect to cycle-accurate simulation under high traffic injection.
TS07.5 A Parallel Mixed-Precision GMRES-IR Solver for Ill-Conditioned Equations in Device Simulation 14:20
Jiawen Cheng¹, Yuanyuan Yang², Ding Gong², Wenjian Yu¹

¹ Tsinghua University ; ² Peifeng Tunan Semiconductor Co., Ltd.

Efficient and reliable device simulation remains a critical challenge for modern electronic design automation (EDA), where ill-conditioned sparse linear equation systems are often solved. Traditional linear matrix solvers struggle to balance accuracy, performance, and scalability concurrently in the presence of ill-conditioning. In this work, we propose a parallel solver framework that integrates mixed-precision iterative refinement with GMRES algorithm and novel architecture-aware optimizations on modern CPUs. Our approach leverages vectorization, parallel scheduling, and memory hierarchy optimizations to accelerate Krylov subspace methods while preserving numerical robustness. Comprehensive evaluation on matrices arising from realistic device simulation problems demonstrates that our solver achieves 5.4× speedup on average, compared to the high-precision direct solver baseline, while maintaining solution accuracy within given tolerances. Moreover, the proposed mixed-precision GMRES-IR solver attains further 3.3× parallel speedup with 8 threads, demonstrating its parallel efficiency.
TS07.6 PALM: Program Analysis and LLM Methods for Crafting SystemVerilog Assertions 14:25
Raheel Afsharmazayejani¹, Benjamin Tan¹

¹ University of Calgary

A promising approach for security verification of a Register-Transfer Level (RTL) design is assertion-based verification (ABV), where desired properties are expressed as SystemVerilog Assertions (SVAs). To create assertions, verification engineers typically start with identifying the relevant modules and necessary variables that are relevant to a given property and then construct the assertion based on those variables. While there have been several attempts to automate assertion creation, prior work identified that automatically recognizing relevant modules and subsequently extracting the required variables within the found module to construct an SVA is a bottleneck. Recently, Large Language Models (LLMs) have emerged, demonstrating promising code generation capabilities. However, their application in helping to automate valid SVA generation, along with the combination of static analysis methods, remains not well explored. This work investigates whether, and to what extent, LLMs can assist in each stage of the automation pipeline or whether their promise requires more evidence to substantiate. This study identifies specific areas where Large Language Models (LLMs) yield measurable and practical improvements in a hybrid workflow, as well as areas where their limitations are evident.
TS07.7 Contract-Based Architecture Exploration of Cyber-Physical Systems via Satisfiability Modulo Convex Programming 14:30
Yifeng Xiao¹, Pierluigi Nuzzo¹

¹ University of California, Berkeley

Exploring system architectures that must satisfy a set of heterogeneous requirements while minimizing a cost metric is a computationally challenging task, due to the exponential growth of the design space with the number of architecture parameters and the interdependencies between architecture choices and system properties. We present a compositional architecture exploration methodology that encodes these interdependencies as assume-guarantee contracts expressed by satisfiability modulo convex programming (SMC) formulas. Inspired by the SMC paradigm, our approach coordinates integer programming for architecture selection with convex programming for requirement verification. It then introduces a two-level pruning scheme that leverages irreducible infeasible sets of convex constraints and verification counterexamples to generate infeasibility certificates that can effectively eliminate infeasible design configurations from the search space. Evaluations on representative benchmarks, including aircraft power distribution networks and reconfigurable production lines, show an order-of-magnitude speedup on large problem instances and the solution of problems on which prior methods time out, enabling tractable and scalable compositional design of complex system architectures.
TS07.8 DyNAMoE: Dynamic Reconfigurable NoC-based Accelerator for Mixture-of-Expert Models 14:35
Mohit Upadhyay¹, Li-Shiuan Peh²

¹ National University of Singapore ; ² Professor, National University of Singapore

Our characterization of MoE execution on GPUs revealed that while GPUs can parallelize expert execution well, expert index computation and routing of MoE inputs lead to bottlenecks. This work introduces DyNAMoE, an accelerator designed to map and execute MoE layers efficiently for increased performance and energy efficiency. Specifically, DyNAMoE proposes specialized dynamically reconfigurable NoCs for routing tokens, distributing inputs and weights and reducing results at runtime for accelerating MoE layers. Our results show DyNAMoE realizing more than 40x faster latency than edge GPUs and more than 13.7x faster than statically scheduled systolic array architectures.

TS08 Next-Generation AI Hardware Accelerators 14:00 - 15:30 | Nabucco

Chair: Christian Haubelt (University of Rostock, DE)

Co-Chair: Frederic Mallet (Universite Cote d'Azur, FR)

TS08.1 Dynamic Rank-Aware Aggregation with Graph Contrastive Learning for Federated Foundation Model Fine-Tuning 14:00
Zhao Yang¹, Xuanyun Qiu², Hua Cui¹

¹ Chang'an University ; ² Imperial College London

Foundation Models (FMs) achieve strong performance across natural language tasks but require task-specific adaptation. Federated Learning enables privacy-preserving fine-tuning, yet full-parameter updates at FM scale are costly. Federated Low-Rank Adaptation alleviates this by constraining updates to low-rank subspaces, reducing communication and storage overhead. However, heterogeneous and evolving client data distributions introduce inconsistencies in low-rank representations, causing unstable aggregation and degraded generalization. We propose a graph contrastive learning–enhanced dynamic aggregation strategy to address these challenges. Wasserstein distance is used to quantify distribution disparities, constructing a similarity graph that encodes potential knowledge-sharing relations. Graph Contrastive Learning then models dynamic feature embeddings at the server, capturing temporal shifts in distributions. A consistency-guided weighting mechanism further adapts client contributions during aggregation, suppressing conflicting updates and amplifying effective ones. Extensive experiments on diverse federated benchmarks verify the effectiveness of our approach, demonstrating improved stability, adaptability, and generalization compared to existing methods.
TS08.2 FedTPA: Tackling Data Heterogeneity with Adaptive Parameter Allocation in Federated Instruction Tuning 14:05
Yixuan Chen¹, Jinghui Zhang¹, Ding Ding²

¹ Southeast University ; ² School of Computer Science and Engineering, Southeast University

Federated instruction tuning of large language models (LLMs) has recently emerged as a promising research direction for preserving data privacy while enabling collaborative model adaptation. However, due to the heterogeneity of local instruction data across clients in federated settings, assigning the same trainable parameter size to all clients may compromise local learning effectiveness and limit the overall performance of the global model. To address this challenge, we propose FedTPA, a dynamic pruning-based strategy that allocates and adjusts the adapter dimensions of local models based on the distribution of local instruction data and trends in training loss. This allows the trainable parameter size on each client to better align with the complexity and characteristics of its local data. We evaluate FedTPA across multilingual, multi-task, and varying degrees of data heterogeneity scenarios. Experimental results demonstrate that FedTPA outperforms existing federated instruction tuning methods, achieving up to a 3% improvement in Rouge-L scores.
TS08.3 Shakan: Training-Inference co-design for Oblique Random Forests on Embedded Devices 14:10
Alessandro Annechini¹, Alessandro Verosimile¹, Marco D. Santambrogio¹

¹ Politecnico di Milano

Embedded systems are increasingly leveraging Artificial Intelligence of Things (AIoT) to enable real-time decision-making in critical applications, such as autonomous navigation and medical diagnostics. In these contexts, Random Forests (RFs) have been widely adopted due to their inherent parallelism. However, RFs rely on axis-aligned splits, which limit their ability to model complex decision boundaries. Oblique Random Forests (ORFs), which employ hyperplane-based splits, offer a more expressive alternative by improving classification accuracy. Despite their advantages, inference of ORFs is resource-consuming, prohibiting the implementation of such models on resource-constrained hardware devices. In this work, we present Shakan, a novel framework for Oblique Decision Trees (ODTs) inference on embedded systems. We introduce a new training technique designed to mitigate both training complexity and overfitting while enabling low-latency inference in hardware, along with a new architecture that maximizes performance and optimizes resource usage. Shakan enables, on resource-constrained devices, the inference of several ORFs configurations that can provide either a significant increase in accuracy or a notable speedup in terms of inference latency compared to state-of-the-art accelerators for traditional RFs on embedded devices. The most accurate configurations provide average accuracy improvements above 5% with similar latency, while the fastest configurations achieve speedups of 1140×, 214×, and 29× for tree depths of 5, 7, and 9, respectively, with comparable accuracy.
TS08.4 Dynamic Neural Thresholding on Mixed-Signal Neuromorphic Processors Enabled by Integrated Learning and Hardware Design 14:15
Kyuseung Han¹, Kwang-Il Oh¹, Sukho Lee¹, Hyeonguk Jang¹, Jae-Jin Lee¹, Sooyoung Jang²

¹ ETRI ; ² Hanbat National University

Spiking neural networks (SNNs) can improve inference accuracy through joint optimization of synaptic weights and neuronal thresholds. However, mixed signal neuromorphic processors, which are designed for energy efficiency using analog circuits, face practical limitations. In particular, digital to analog converters (DACs) often lack sufficient resolution to represent the large threshold values required by joint optimization. To address this issue, we propose a mixed signal neuromorphic processor architecture that shifts threshold control to digital logic. This approach removes the need for high-resolution DACs and allows dynamic threshold adjustment without modifying the analog neural core. We also propose a learning method tailored to this architecture. We evaluate the proposed design on five image classification benchmarks, measuring accuracy, latency, and energy consumption. The results show that our architecture consistently improves accuracy across benchmarks while incurring only minimal latency and energy overhead. This demonstrates that the proven benefits of joint weight and threshold learning can be realized in energy efficient analog hardware.
TS08.5 VMXDOTP: A RISC-V Vector ISA Extension for Efficient Microscaling (MX) Format Acceleration 14:20
Max Wipfli¹, Gamze Islamoglu¹, Navaneeth Kunhi Purayil¹, Angelo Garofalo², Luca Benini³

¹ ETH Zurich ; ² University of Bologna, ETH Zurich ; ³ Università di Bologna and ETH Zurich

Compared to the first generation of deep neural networks, dominated by regular, compute-intensive kernels such as matrix multiplications (MatMuls) and convolutions, modern decoder-based transformers interleave attention, normalization, and data-dependent control flow. This demands flexible accelerators, a requirement met by scalable, highly energy-efficient shared-L1-memory vector processing element (VPE) clusters. Meanwhile, the ever-growing size and bandwidth needs of state-of-the-art models make reduced-precision formats increasingly attractive. Microscaling (MX) data formats, based on block floating-point (BFP) representations, have emerged as a promising solution to reduce data volumes while preserving accuracy. However, MX semantics are poorly aligned with vector execution: block scaling and multi-step mixed-precision operations break the regularity of vector pipelines, leading to underutilized compute resources and performance degradation. To address these challenges, we propose VMXDOTP, a RISC-V Vector (RVV) 1.0 instruction set architecture (ISA) extension for efficient MX dot product execution, supporting MXFP8 and MXFP4 inputs, FP32 and BF16 accumulation, and software-defined block sizes. A VMXDOTP-enhanced VPE cluster achieves up to 97 % utilization on MX-MatMul. Implemented in 12 nm FinFET, it achieves up to 125 MXFP8-GFLOPS and 250 MXFP4-GFLOPS, with 843/1632 MXFP8/MXFP4-GFLOPS/W at 1 GHz, 0.8 V, and only 7.2 % area overhead. Our design yields up to 7.0× speedup and 4.9× energy efficiency with respect to software-emulated MXFP8-MatMul. Compared with prior MX engines, VMXDOTP supports variable block sizes, is up to 1.4× more area-efficient, and delivers up to 2.1× higher energy efficiency.
TS08.6 Improving Reliability in Quantized Graph Neural Networks with Node-Wise Entropy-driven Temperature Scaling 14:25
Hadi Mousanejad Jeddi¹, Jose Nunez-Yanez²

¹ Linköping University ; ² Linkoping University

Graph Neural Networks (GNNs) are one of the most powerful learning methods for graph-structured data and their quantization significantly reduces memory and computational requirements on edge devices. In this paper, we show that the quantization of node features, edge connections, and hidden representations degrades confidence calibration. To address this issue we propose a node-wise temperature scaling method that dynamically calibrates model confidence by aggregating entropy-based uncertainty from graph-structured data. Our approach combines self-entropy, neighborhood-entropy, and shortest-path distances to labeled nodes into a unified feature representation followed by a learnable transformation to compute temperature values for each node. We integrate and evaluate our approach using a dataflow hardware accelerator optimized for multi-precision GNN models, which supports efficient training and inference on-device. Our method significantly improves calibration by reducing Expected Calibration Error (ECE) and Negative Log-Likelihood (NLL) by up to 95\% and 66\%, respectively, across multiple datasets, without decreasing accuracy after quantization. The implementation is publicly available.
TS08.7 A Self-Supervised Neuromorphic Processor Using High-Dimensional Representations for Cognitive Map Navigation 14:30
Anqin Xiao¹, Luyu Yang¹, Yuhan He¹, Hengtan Zhang¹, Ziyi Yang¹, Lirong Zheng¹, Zhuo Zou¹

¹ Fudan University

This work proposes a self-supervised neuromorphic processor using high-dimensional representations for cognitive map navigation. By employing the Cognitive Map Learner (CML), it enables agents to explore and understand diverse environments online through random walks. To enhance path planning, the agent's actions and observations are embedded into high-dimensional state spaces. This embedding creates a sense of direction, simplifying navigation into a retrieval process within an Associative Memory (AM). We design an energy-efficient processor that features a scalable multi-core hardware architecture with precision flexibility, combined with an on-chip random walk training engine. To balance the precision of the model with hardware overhead, two hardware-software co-design strategies are proposed. The first is a Content-Addressable Memory (CAM)-based approach for AM access, which reduces the number of memory access by up to 25%. The second involves high-dimensional matrix sparsity optimizations, reducing computation operations to less than 8%. We simulate this processor by a 40-nm CMOS technology, which has 2.88 mm^2 core area with 15.8 mW power at a frequency of 140 MHz. Compared to previous processors, our experiments show that the proposed processor achieves outstanding success rates of 99.9%, 96%, and 98.7% on 100 2D nodes, 125 3D nodes, and 25 abstract map nodes with obstacles, respectively. In terms of energy efficiency, it delivers a path planning result of 28 nJ/node and 35 nJ/node in 2D and 3D maps, offering a 1.2x to 2.9x improvement over the state-of-the-art.
TS08.8 Parallel-SA: Point Cloud Processing Acceleration via Parallel Set Abstraction 14:35
Dongdong Tang¹, Weilan Wang², Yu Mao², Wenjing XIE², Nan Guan², Tei-Wei Kuo³, Chun Jason Xue⁴

¹ City University of HongKong&MBZUAI ; ² City University of Hong Kong ; ³ National Taiwan University ; ⁴ MBZUAI

Point-based networks achieve high accuracy by preserving the intrinsic spatial structure of point clouds. The spatial information is effectively extracted by set abstraction, a critical module for feature learning in point-based networks. However, set abstraction introduces a computational bottleneck, and naive parallelization often degrades sampling quality, leading to accuracy loss. To address these challenges, we propose Parallel-SA, a framework that accelerates point-based networks by transforming set abstraction from sequential to parallel processing without sacrificing accuracy. Parallel-SA leverages a multi-scale sampling distribution approximation to preserve sampling quality under parallel execution. In addition, it employs distribution-aware balanced partitioning and adaptive load-balancing refinement to further improve efficiency. Experiments show that Parallel-SA achieves an average 2.38x speedup in set abstraction with minimal accuracy degradation.
TS08.9 SONIC: Smart Optimization for Neural-Integrated CMP with Timing-Aware Fills 14:40
Jiajun Tan¹, Qichao Ma², Yiming Du², Yiming Gan³, LING LIANG², Yibo Lin², Ming Zhu⁴, Zongwei Wang², Yimao Cai²

¹ Peking university ; ² Peking University ; ³ Institute of Computing Technology, Chinese Academy of Sciences ; ⁴ Anhui university

Dummy fill insertion is essential for CMP uniformity but remains challenging due to the nonlinear CMP process, the large optimization space, and timing degradation caused by parasitic coupling. We propose SONIC, a differentiable CMP-driven dummy fill optimization framework that employs a neural CMP simulator to directly optimize planarization objectives using gradient-based methods. SONIC further integrates a timing-aware fill insertion strategy to mitigate coupling capacitance near critical nets. Experimental results demonstrate that SONIC achieves competitive planarization quality with up to 1830 times runtime speedup over a full-chip CMP simulator. Compared with the state-of-the-art model-based method, SONIC reduces height variation, line deviation, and outliers by up to 86.16%, 90.10%, and 51.61%, respectively, while achieving a 77.67% runtime reduction and lowering coupling capacitance by 13.05%.

LKS02 Later ? with the keynote speakers 14:00 - 15:00 | Turandot

Wrap up today?s lunchtime keynote presentation with an open and informal Q&A. Meet the speaker, ask your questions, and take part in a lively conversation that continues the ideas sparked during the keynote.

Chair: Valeria Bertacco (University of Michigan, US)

LKS02.1 Unifying Accelerator Design and Programming for Evolvable Computing 14:00
Zhiru Zhang¹

¹ Cornell University

We are living through a fundamental shift in computing, where performance and efficiency gains increasingly come from specialized accelerators tailored to "hot" domains like AI. Yet as accelerator-centric computing proliferates, it continues to build atop a longstanding disconnect between the way we design these systems and the way we program them. This divide slows hardware innovation, complicates the software stack, and makes accelerators far harder to evolve than the rapidly changing applications they are meant to serve. In this talk, I will present our recent research on closing this gap through a unified hardware-software co-design stack. At the core is a new abstraction that generalizes single program multiple data beyond largely independent threads to encompass spatial structure and communication. It offers a single, composable model for expressing both how accelerators are constructed and how they are programmed. I will show how this abstraction is realized in Allo, an open-source, Python-based, MLIR-powered framework to enable concise specifications that map efficiently across FPGA, ASIC, NPU, and GPU platforms. I will conclude with a forward-looking perspective on automated compiler construction, differentiable hardware synthesis, and agentic design automation, and discuss how these directions may help move us toward truly evolvable heterogeneous computing.

ET05 Introduction to Post Quantum Cryptography for Embedded and IoT Systems 14:00 - 15:30 | Rigoletto

ET05.1 Introduction to Post Quantum Cryptography for Embedded and IoT Systems 14:00
Leonidas Kosmidis¹, Sergi Alcaide²

¹ Barcelona Supercomputing Center (BSC) and Universitat PolitÃ¨cnica de Catalunya (UPC) ; ² Barcelona Supercomputing Center (BSC)

With the advent and quick improvement of quantum computers, conventional cryptographic systems will become vulnerable in the near future, creating a high risk in security and privacy. For this reason, post quantum cryptographic schemes have been developed which impose new hardware and software requirements to existing and upcoming hardware. While there are several works in the literature focused on post quantum cryptography acceleration in general purpose, high performance systems, the vast majority of computing devices, i.e. embedded, Internet of Things are not covered by them, due to their high area, power or memory cost. In this tutorial, we will focus on the particular case of these devices. The tutorial will start with a high-level introduction to post quantum cryptography, followed by existing works in the general purpose domain. Next we will cover in detail the embedded systems domain, their particular requirements and explain why existing methods are not applicable to them. We will provide specific examples and use cases from various embedded and IoT related domains, such as automotive, aerospace, IoT devices and extremely low power devices like passports and other machine readable travel documents. The examples will include the active standardization activities in various bodies including RISC-V, as well as on-going efforts in research projects, such as SMARTY (https://www.smarty-project.eu/ ) and PQC4eMRTD ( https://www.pqc4emrtd-project.eu/ ) in which the speakers participate and coordinate their hardware design activities.

W01 Reactive CPS (ReCPS): Workshop on Reactive Cyber-Physical Systems: Design, Simulation, and Coordination 14:00 - 18:00 | Bohème

TS09 Intelligent Design Space Exploration and High-Level Synthesis Optimization 16:30 - 18:00 | Traviata

Chair: Lilia Zaourar (CEA LIST, FR)

Co-Chair: Alexandra Kourfali (EuroHPC Joint Undertaking, LU)

TS09.1 MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven Exploration 16:30
Lei Xu¹, Shanshan Wang¹, Chenglong Xiao¹

¹ Shantou University

High-Level Synthesis (HLS) design space exploration (DSE) seeks Pareto-optimal designs within expansive pragma configuration spaces. To accelerate HLS DSE, graph neural networks (GNNs) are commonly employed as surrogates for HLS tools to predict quality of results (QoR) metrics, while multi-objective optimization algorithms expedite the exploration. However, GNN-based prediction methods may not fully capture the rich semantic features inherent in behavioral descriptions, and conventional multi-objective optimization algorithms often do not explicitly account for the domain-specific knowledge regarding how pragma directives influence QoR. To address these limitations, this paper proposes the MPM-LLM4DSE framework, which incorporates a multimodal prediction model (MPM) that simultaneously fuses features from behavioral descriptions and control and data flow graphs. Furthermore, the framework employs a large language model (LLM) as an optimizer, accompanied by a tailored prompt engineering methodology. This methodology incorporates pragma impact analysis on QoR to guide the LLM in generating high-quality configurations (LLM4DSE). Experimental results demonstrate that our multimodal predictive model significantly outperforms state-of-the-art work ProgSG by up to 10.25$\times$. Furthermore, in DSE tasks, the proposed LLM4DSE achieves an average performance gain of 39.90\% over prior methods, validating the effectiveness of our prompting methodology. Code and models are available at \url{https://github.com/wslcccc/MPM-LLM4DSE}.
TS09.2 DAPO: Design Structure-Aware Pass Ordering for HLS via Contrastive and Reinforcement Learning 16:35
Jinming Ge¹, Linfeng Du², Likith Anaparty³, Shangkun LI¹, Tingyuan Liang², Afzal Ahmad¹, Vivek Chaturvedi³, Sharad Sinha⁴, Zhiyao Xie¹, Jiang Xu⁵, Wei Zhang¹

¹ Hong Kong University of Science and Technology ; ² The Hong Kong University of Science and Technology ; ³ Indian Institute of Technology Palakkad ; ⁴ Indian Institute of Technology (IIT) Goa ; ⁵ hong kong university of science and technology (Guangzhou)

High-Level Synthesis (HLS) tools are widely adopted in FPGA-based domain-specific accelerator design. However, existing tools rely on fixed optimization strategies inherited from software compilations, limiting their effectiveness. Tailoring optimization strategies to specific designs requires deep semantic understanding, accurate hardware metric estimation, and advanced search algorithms---capabilities that current approaches lack. We propose DAPO, a design structure-aware pass ordering framework that extracts program semantics from control and data flow graphs, employs contrastive learning to generate rich embeddings, and leverages an analytical model for accurate hardware metric estimation. These components jointly guide a reinforcement learning agent to discover design-specific optimization strategies. Evaluations on standard HLS benchmarks demonstrate that our end-to-end flow delivers 1.67x speedup on pragma-free designs and a 2.36x speedup on designs with pragmas over Vitis HLS with comparable resource usage.
TS09.3 AutoShrink: Adaptive Search Space Shrinkage for Large-Scale Pareto Optimization of HLS Designs 16:40
Yingxin Zeng¹, Binghao Cheng¹, Jianwang Zhai², Kang Zhao², Zhe Lin¹

¹ Sun Yat-sen University ; ² Beijing University of Posts and Telecommunications

High-level synthesis (HLS) streamlines accelerator customization by delivering a high-level hardware programming paradigm enriched with a variety of optimization directives. However, the quality of HLS designs is largely determined by the selection of directives in navigating trade-offs among multiple design metrics, a non-trivial process that can significantly prolong design turnaround time. Design space exploration (DSE) serves as a promising solution to this problem, but existing studies on DSE suffer from a lack of efficiency or generalization capability in large-scale application scenarios. To address this problem, this paper proposes AutoShrink, a DSE engine that automatically and adaptively shrinks the large search space of an HLS design to gradually retain only high-quality solutions. AutoShrink incorporates: (1) a comprehensive design space pruning strategy that integrates domain knowledge and consolidates the joint effect of directives; and (2) an importance-guided Pareto optimization algorithm that dynamically tracks the importance ranking of the applied directives and leverages this ranking to effectively steer the search toward Pareto-optimal solutions. Experimental results demonstrate that AutoShrink efficiently achieves a close approximation of the Pareto frontier across diverse benchmarks with design spaces scaling up to 10^16, which attains an average deviation of only 8.1%, outperforming three generic optimization methods and three state-of-the-art customized approaches by 5.73x and 4.47x, respectively.
TS09.4 Area Efficient Speculative Loop Pipelining for High-Level Synthesis 16:45
Dylan Leothaud¹, Simon Rokicki², Steven Derrien³, Isabelle Puaut⁴

¹ Univ Rennes, IRISA ; ² Irisa ; ³ Université de Bretagne Occidentale/Lab-STICC ; ⁴ Univ Rennes, Inria, CNRS, IRISA

High-Level Synthesis (HLS) allows the automatic generation of efficient circuit designs for computation-intensive kernels, but it lacks flexibility when dealing with irregular control flow. To address this issue, dynamic and speculative HLS techniques are used. These techniques outperform state-of-the-art HLS in kernel execution times but introduce a significant area overhead. In contrast, state-of-the-art HLS easily highlights and exploits resource-sharing opportunities. In this work, we show how to adapt an existing speculative HLS approach to take advantage of well-known static resource sharing mechanisms. Our results show a decrease of the area cost by 34% in average.
TS09.5 Anchor-and-Adapt: HLS QoR Prediction using Ground-Truth Seeding and Few-Shot Fine-Tuning 16:50
Gabriel Tavares¹, Heitor de Andrade¹, Fábio Itturriet², Gabriel Luca Nazar¹

¹ Universidade Federal do Rio Grande do Sul ; ² Universidade Tecnológica Federal do Paraná

Graph Neural Networks (GNNs) have emerged as powerful tools for guiding Design Space Exploration (DSE) in High-Level Synthesis (HLS), but they face a critical trade-off: fast, pre-HLS models are often inaccurate, while accurate, post-HLS models are too slow for iterative exploration. This paper introduces a novel framework that targets this dilemma. Our approach centers on an anchor-based graph representation, where a single, ground-truth hardware implementation is used to seed a design graph with rich, post-implementation data. A heterogeneous GNN is then trained to predict the QoR delta caused by applying new optimization directives to this anchor, enabling rapid and high-fidelity estimation without re-running the HLS toolchain. Furthermore, we propose a dual-strategy framework that offers both a quick-setup Ensemble Model for general use and a specialized, high-accuracy Few-Shot Fine-Tuned Model for maximum per-kernel precision. Our results demonstrate that this methodology achieves significantly higher accuracy than pre-HLS methods, reducing the prediction error by 35-50% while maintaining a comparable exploration speed. More remarkably, our few-shot specialized model surpasses the prediction accuracy of post-HLS methods by more than 10% without incurring their restrictive per-point runtime cost.
TS09.6 Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation 16:55
M Zafir Sadik Khan¹, Kimia Zamiri Azar¹, Hadi Kamali¹

¹ University of Central Florida

In last two years, large language models (LLMs) have shown strong capabilities in code generation, including hardware design at register-transfer level (RTL). While their use in high-level synthesis (HLS) remains comparatively less mature, the ratio of HLS- to RTL-focused studies has shifted from 1:10 to 2:10 in the past six months, indicating growing interest in leveraging LLMs for high-level design entry while relying on downstream synthesis for optimization. This growing trend highlights the need for a comprehensive benchmarking and evaluation framework dedicated to LLM-based HLS. To address this, We present Bench4HLS for evaluating LLM-generated HLS designs. Bench4HLS comprises 170 manually drafted and validated case studies, spanning small kernels to complex accelerators, curated from widely used public repositories. The framework supports fully automated assessment of compilation success, functional correctness via simulation, and synthesis feasibility/optimization. Crucially, Bench4HLS integrates a pluggable API for power, performance, and area (PPA) analysis across various HLS toolchains and architectures, demonstrated here with Xilinx Vitis HLS and validated on Catapult HLS. By providing a structured, extensible, and plug-and-play testbed, Bench4HLS establishes a foundational methodology for benchmarking LLMs in HLS workflows.
TS09.7 OpenACM: An Open-Source SRAM-Based Approximate CiM Compiler 17:00
Yiqi Zhou¹, JunHao Ma¹, Xingyang Li², Yule Sheng¹, Yue Yuan¹, Yikai Wang¹, Bochang Wang¹, Yiheng Wu¹, Shan Shen¹, Wei Xing³, Daying Sun¹, Li Li¹, Zhiqiang Xiao⁴

¹ Nanjing University of Science and Technology ; ² Beihang University ; ³ University of Sheffield ; ⁴ The 58th Research Insitute of China Electronics Technology Group Corporation

The rise of data-intensive AI workloads has exacerbated the "memory wall" bottleneck. Digital Compute-in-Memory (DCiM) using SRAM offers a scalable solution, but its vast design space makes manual design impractical, creating a need for automated compilers. A key opportunity lies in approximate computing, which leverages the error tolerance of AI applications for significant energy savings. However, existing DCiM compilers focus on exact arithmetic, failing to exploit this optimization. This paper introduces OpenACM, the first open-source, accuracy-aware compiler for SRAM-based approximate DCiM architectures. OpenACM bridges the gap between application error tolerance and hardware automation. Its key contribution is an integrated library of accuracy-configurable multipliers (exact, tunable approximate, and logarithmic), enabling designers to make fine-grained accuracy-energy trade-offs. The compiler automates the generation of the DCiM architecture, integrating a transistor-level customizable SRAM macro with variation-aware characterization into a complete, open-source physical design flow based on OpenROAD and the FreePDK45 library. This ensures full reproducibility and accessibility, removing dependencies on proprietary tools. Experimental results on representative convolutional neural networks (CNNs) demonstrate that OpenACM achieves energy savings of up to 64% with negligible loss in application accuracy. The framework is available on OpenACM:URL.
TS09.8 RAPID-Graph: Recursive APSP using Processing-In-Memory for Dynamic Programming on Graphs 17:05
Yanru Chen¹, Zheyu Li², Keming Fan¹, Runyang Tian¹, John Hsu¹, Weihong Xu³, Minxuan Zhou⁴, Tajana Rosing²

¹ University of California, San Diego ; ² UCSD ; ³ Zhejiang University ; ⁴ Illinois Tech

All-pairs shortest paths (APSP) remains a major bottleneck for large-scale graph analytics, as data movement with cubic complexity overwhelms the bandwidth of conventional memory hierarchies. In this work, we propose RAPID-Graph to address this challenge through a co-designed processing-in-memory (PIM) system that integrates algorithm, architecture, and device-level optimizations. At the algorithm level, we introduce a recursion-aware partitioner that enables an exact APSP computation by decomposing graphs into vertex tiles to reduce data dependency, such that both Floyd-Warshall and Min-Plus kernels execute fully in-place within digital PIM arrays. At the architecture and device levels, we design a 2.5D PIM stack integrating two phase-change memory compute dies, a logic die, and high-bandwidth scratchpad memory within a unified advanced package. An external non-volatile storage stack stores large APSP results persistently. The design achieves both tile-level and unit-level parallel processing to sustain high throughput. On the 2.45M-node OGBN-Products dataset, RAPID-Graph is 5.8× faster and 1 186× more energy efficient than state-of-the-art GPU clusters, while exceeding prior PIM accelerators by 8.3× in speed and 104× in efficiency. It further delivers up to 42.8× speedup and 392× energy savings over an NVIDIA H100 GPU.

TS10 Optical, Neuromorphic and In-Sensor Accelerators 16:30 - 18:00 | Rigoletto

Chair: Jona Beysens (KU Leuven, BE)

Co-Chair: Kasim Sinan Yildirim (University of Trento, IT)

TS10.1 PICoSNN: Partially Incoherent Configurable Optical Computing Architecture for SNN Acceleration 16:30
Bowen Duan¹, Zhenhua Zhu², Zhengyang Duan², Huazhong Yang², Yuan Xie³, Yu Wang²

¹ Duke University ; ² Tsinghua University ; ³ Hong Kong University of Science and Technology

Optical computing is becoming a promising solution to meet the growing computational demands of increasingly large-scale deep neural networks (DNNs). However, high power consumption from analog-to-digital (ADC) and digital-to-analog (DAC) conversions poses significant challenges for optical computing. Spiking Neural Networks (SNNs), with their binary spike-based input and output, show the potential to address this issue by reducing the need for high-precision DAC/ADC. In order to exploit the complementary nature of optical computing and spike-based processing, this paper proposes the Partially Incoherent Configurable Optical Computing Architecture for SNN Acceleration (PICoSNN). We address three critical challenges: phase errors in coherent optical computing, limited configurability in weight-stationary architectures, and inefficient mapping of general SNNs to optical computing hardware. We integrate partially incoherent tensor cores with optical leaky integrate-and-fire neurons, minimizing ADC/DAC overhead while supporting dynamic weight mapping. Further, we propose KV Spiking Self-Attention to enable efficient attention with 1-bit multiplications. Experimental results show that PICoSNN achieves up to 70.54× higher throughput and 8.13× lower energy consumption compared to ASIC implementations, while delivering 15.46× better throughput per area and 17.67× better energy efficiency per area than state-of-the-art photonic accelerators.
TS10.2 LAMP: An Adaptive Near-Memory Processing System for High-Performance Long-Read Mapping 16:35
Jo-Ling Huang¹, Liang-Chi Chen², Chien-Chung Ho¹, Yuan-Hao Chang²

¹ National Cheng Kung University ; ² National Taiwan University

Long-read sequencing technologies have improved genome assembly and structural variant detection, but impose heavy computational demands on alignment tools, which remain bottlenecked by memory-bound seeding and chaining phases. This work presents LAMP, an adaptive near-memory processing (NMP) framework that coordinates CPU execution with the NMP platform to accelerate these stages. LAMP introduces two key mechanisms: (1) a load-spreading hashing scheme that balances minimizer distribution for memory-constrained PUs during index construction, and (2) a density-aware adaptive dispatching strategy that mitigates workload skew in chaining by partitioning reads according to anchor density. Evaluated on the Hg38 reference and real long-read datasets, LAMP achieves up to 3.8× speedup over CPU-only execution. The results show that system-level co-design can overcome the memory bandwidth and load imbalance challenges of NMP architectures, enabling scalable, high-throughput genomic analysis.
TS10.3 MD-SNN: Membrane Potential-aware Distillation on Quantized Spiking Neural Network 16:40
Donghyun Lee¹, Abhishek Moitra², Youngeun Kim², Ruokai Yin², Priyadarshini Panda¹

¹ University of Southern California ; ² Yale University

Spiking Neural Networks (SNNs) offer a promising and energy-efficient alternative to conventional neural networks, thanks to their sparse binary activation. However, they face challenges regarding memory and computation overhead due to complex spatio-temporal dynamics and the necessity for multiple backpropagation computations across timesteps during training. To mitigate this overhead, compression techniques such as quantization are applied to SNNs. Yet, naively applying quantization to SNNs introduces a mismatch in membrane potential, a crucial factor for the firing of spikes, resulting in accuracy degradation. In this paper, we introduce Membrane-aware Distillation on quantized Spiking Neural Network (MD-SNN), which leverages membrane potential to mitigate discrepancies after weight, membrane potential, and batch normalization quantization. To our knowledge, this study represents the first application of membrane potential knowledge distillation in SNNs. We validate our approach on various datasets, including CIFAR10, CIFAR100, N-Caltech101, and TinyImageNet, demonstrating its effectiveness for both static and dynamic data scenarios. Furthermore, for hardware efficiency, we evaluate the MD-SNN with SpikeSim platform, finding that MD-SNNs achieve 14.85X lower energy-delay-area product (EDAP), 2.64X higher TOPS/W, and 6.19X higher TOPS/mm2 compared to floating point SNNs at iso-accuracy on N-Caltech101 dataset.
TS10.4 A Digital Neural Array IC for Real-Time Neural Network Replication from Spike Activities 16:45
Donghyun Park¹, Hajung Mun², Minhyeok Jeong³, Dahee Kang¹, Jongmin Lee⁴, Yoonmyung Lee⁵

¹ Sungkyunkwan university ; ² Samsung Electronics ; ³ Dept. of Electrical and Computer Engineering, Sungkyunkwan University ; ⁴ Ajou University ; ⁵ Sungkyunkwan University

Growing demand to deepen understanding of the human brain has accelerated efforts to identify the structure of biological neural networks from neuronal activities. This paper presents a fully digital, tile‑able neural array integrated circuit(IC) that, to our knowledge, is the first hardware platform for network reconstruction—inferring synaptic connectivity directly from spike‑train data generated by a biological (ground‑truth) network. Designed with the overarching goal of emulation of biological networks, the architecture employs repeatable digital neuron‑module tiles to ensure the scalability, flexibility and verifiability. Two chip‑level run‑time interfaces are integrated: a writable spike‑forcing path for injecting biological spike pulses into selected IC neurons for synchronized co‑firing, and a dedicated monitoring path for streaming spike events, synaptic weights, and membrane potentials. Scalability is further enabled by single‑timer Δt capture and a piecewise‑linear STDP Δw generator shared per neuron, avoiding complex LUTs/multipliers while preserving biological plausibility. The platform is realized in silicon and tested with an FPGA‑based setup. Using only spike activity, the system reconstructs synaptic connectivity with high fidelity across diverse ground‑truth networks ranging from simple 2‑layer topologies to biologically derived C.elegans head network, as well as networks with bimodal and trimodal weight distributions. Accuracy was comprehensively quantified from multiple perspectives for both connectivity and spike‑train similarity, confirming faithful replication. These results demonstrate that the proposed platform can recover the synaptic structure from spikes alone and provide a practical tool for predicting network learning responses under varied stimuli, advancing biological neural network research and real‑time neuromorphic experimentation.
TS10.5 GMaC: NvCIM Architecture for Parallel Point-based Point Cloud Acceleration via Geometric Mapping and Address-Index Computation 16:50
Yi Gao¹, Zongwei Wang¹, LING LIANG¹, Yimao Cai¹

¹ Peking University

3D point cloud is essential for spatial understanding and 3D modeling in autonomous driving, robotics, and AR/VR. Due to the efficient processing of unordered point clouds, Point-based Point Cloud Neural Networks (PNNs) have attracted significant attention, with accelerators targeting two primary stages: down-sampling and feature computation. However, the rapid growth of point cloud data exposes severe memory-wall bottlenecks in traditional von Neumann architectures and challenges in supporting intensive MVM operations. Existing CIM architectures also lack efficient support for Euclidean distance computation, incurring additional data movement and programming overhead. To address the above challenges, we propose GMaC, an nvCIM architecture for parallel PNN acceleration. GMaC introduces a hardware-friendly geometric mapping algorithm to enable parallel execution between down-sampling and feature computation. A novel RRAM-based address-index computation circuit is further proposed to accelerate Euclidean distance calculations in the analog domain for down-sampling. We validate GMaC with a hardware simulator based on the RRAM-CIM platform. Experimental results demonstrate that GMaC achieves 3.57× speedup and 4.96× energy savings compared to the state-of-the-art ASIC designs, with negligible accuracy loss. These findings highlight the potential of nvCIM technology in edge-based PNN implementation.
TS10.6 FSR-GeMM: A Scalable FSR-Parallel Photonic Accelerator for Real-Valued GeMM Computing 16:55
Peiyu CHEN¹, Yinyi LIU², Minhang XU¹, Chongyi Yang³, Xiaohan Jiang⁴, Wei Zhang⁵, Jiang Xu³

¹ The Hong Kong University of Science and Technology (Guangzhou) ; ² Electronic and Computer Engineering Department, The Hong Kong University of Science and Technology ; ³ Hong Kong University of Science and Technology (Guangzhou) ; ⁴ The Hong Kong University of Science and Technology ; ⁵ Hong Kong University of Science and Technology

Photonic computing is poised to revolutionize artificial intelligence (AI) acceleration by offering exceptional speed and energy efficiency for General Matrix Multiplication (GeMM). However, existing works on photonic tensor core architectures face significant challenges in managing real-valued and dynamic operands. Specifically, Mach-Zehnder interferometer (MZI) meshes require computationally intensive singular value decomposition (SVD) for matrix preprocessing, while microring resonator (MRR) weight banks are limited to non-negative operands, complicating operations with dual negative values. Additionally, coherent interference crossbars, although theoretically capable of supporting real-valued multiplication, struggle with fabrication complexities and sensitivity to environmental variations. To address these limitations, we propose FSR-GeMM schema, a scalable photonic accelerator that leverages free-spectral range (FSR) multiplexing. This architecture eliminates the need for SVD preprocessing, supports direct multiplication of two dynamic real-valued operands, and enhances reliability and scalability. Experimental results from a photonic-electronic prototype demonstrate that FSR-GeMM achieves up to 57x improvements in area efficiency and 13.8x gains in energy efficiency compared to existing photonic GeMM accelerators. Furthermore, it reduces energy consumption by 70% relative to MRR-based systems and achieves 21x speedup against leading photonic GeMM accelerator designs, highlighting its potential to advance practical and scalable AI acceleration.
TS10.7 RETRO: Mitigating Power Side-Channel Attacks with Reconfigurable RFET-based Ring Oscillators 17:00
Nima Kavand¹, Tushar Niranjan², Armin Darjani¹, Akash Kumar²

¹ Ruhr University Bochum, TU Dresden ; ² Ruhr University Bochum

Power side-channel attacks are among the most effective physical attacks, threatening the security of circuits such as cryptographic circuits by exploiting information leakage from their physical implementation. Among various masking and hiding countermeasures that have been proposed, Ring Oscillator (RO)-based solutions are considered low-overhead circuitry add-ons that can be integrated into different circuits to hide the data dependency of power consumption by adding noise to their power signatures. The Three-Independent-Gate Reconfigurable Field-Effect Transistor (TIG-RFET) is an emerging technology that offers runtime reconfigurability between N-type and P-type operation, supports both low-Vt and high-Vt modes, and provides an internal wired-AND function, making it a strong candidate for efficient implementation of various hardware security methods. In this paper, we propose a novel reconfigurable RFET-based RO that provides controllable frequency through RFET-based inverters with reconfigurable delay. Using these ROs, we introduce a countermeasure called RETRO, which can generate noise by varying both the amplitude and frequency of power consumption. To evaluate the efficacy of RETRO, we applied it to the Piccolo S-box, a lightweight cryptographic circuit, and simulation results demonstrate that it effectively enhances resilience against Correlation Power Analysis (CPA). Furthermore, we show that reconfigurable frequency broadens the noise spectrum, making filtering considerably more difficult.
TS10.8 C-STEP: Compute-Efficient Spiking Transformers with Temporal Exit and Early-Guided Pruning 17:05
Kyungchul Lee¹, Jongsun Park¹

¹ Korea University

Spiking Transformers (STs) have emerged as an efficient alternative to artificial neural networks, yet they still incur substantial computation, motivating computational reduction techniques. In this paper, we present C-STEP, a unified technique for reducing computation in ST inference. First, we introduce LightSoftmax, a novel lightweight scoring technique that enables confidence-based temporal exit with negligible overhead, allowing inputs with confident predictions to terminate at early timesteps. Second, for the inputs that do not exit, we apply early-time-guided dynamic channel pruning to remove low-contribution channels in later timesteps. Third, we devise a synaptic computation scheme that decomposes spikes into a locally common component and token-specific residuals. The common component is computed once and reused across the tokens, preserving functional equivalence. We have designed an end-to-end SNN architecture that seamlessly executes the proposed low complexity schemes. C-STEP reduces synaptic operations by up to 65.4% relative to the original ST backbones.
TS10.9 Agentic AI for Digital Wellness: Challenges and Architectural Perspectives for Smart Home Care 17:06
Luigi Capogrosso¹, Francesco Biondani², Francesca Bigardi³, Stefano Cordibella³, Giovanni Perbellini³, Walter Vendraminetto³, Franco Fummi²

¹ Interdisciplinary Transformation University of Austria ; ² University of Verona ; ³ EDALAB s.r.l.

The global demographic shift toward an aging population presents a critical socio-economic challenge, necessitating "aging in place" solutions that balance autonomy with safety. Although the Internet of Medical Things (IoMT) offers a theoretical foundation for remote monitoring, current implementations often fail to meet real-world requirements due to high costs, intrusive sensing modalities, and a lack of contextual reasoning. This article outlines the architectural requirements of the next-generation platform for digital health support. We argue that the future of monitoring the elderly lies within the framework of Agentic Artificial Intelligence (AI), a system that not only records events but also reasons about them, detects and adapts to anomalies, and communicates with caregivers through natural language. As a result, the next generation of digital wellness platforms must bridge the gap between technical data and human understanding, providing high-precision detection, human-readable, and context-aware recommendations. This shifts systems from simple data loggers to proactive decision-supporting tools.

TS11 Performance and Security in Next-Generation Cryptographic Systems 16:30 - 18:00 | Tosca

Chair: Elif Bilge Kavun (Barkhausen Institut/TU Dresden, DE)

Co-Chair: Cédric Marchand (Ecole centrale Lyon, FR)

TS11.1 LaMoS: Enabling Efficient \underline{La}rge Number \underline{Mo}dular Multiplication through \underline{S}RAM-based CiM Acceleration 16:30
Haomin Li¹, Fangxin Liu², Chenyang Guan¹, Zongwu Wang², Li Jiang¹, Haibing Guan¹

¹ Shanghai Jiao Tong University ; ² Shanghai Jiaotong University

Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02\times$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.
TS11.2 MARS: A General GPU Optimization Framework for Merkle-Tree–Enabled Cryptography 16:35
Yaoyun Zhou¹, Qian Wang¹

¹ University of California, Merced

Merkle trees underpin diverse cryptographic systems, from hash-based signatures (LMS, XMSS) to blockchain (Bitcoin, Ethereum), post-quantum cryptography (SPHINCS+), and zero-knowledge protocols. Their strength lies in providing efficient verification with compact authentication paths, making them a fundamental primitive in modern secure systems. Existing acceleration of Merkle trees primarily focuses on assigning parallel processing units (e.g., PEs or CUDA cores) to leaf nodes. However, due to the inherent reduction structure, the number of active computations decreases by half at each upper level, leading to progressively lower hardware utilization, especially for trees with greater depth. Furthermore, existing resource allocation strategies for Merkle tree computations rely on heuristic theoretical approximations, which tend to converge to suboptimal configurations. To address these challenges, we propose MARS, a general GPU-based optimization framework, and use the Merkle tree–intensive FORS(Forest of Random Subset) component in SPHINCS+ (with up to 35 trees) as a representative case study. MARS leverages the fixed parameters of FORS as guidance to increase the computational load of individual warps, accelerates the tree-reduction process by minimizing shared memory usage and utilizing local memory more effectively, and finally, inspired by automated tuning, applies profiling-guided parallel optimization across multiple Merkle trees to achieve architecture-portable configurations. Compared to the latest GPU-optimized FORS signature generation on RTX4090, MARS achieves signature generation throughput improvements of 2.52x, 2.39x, and 2.33x under three different $f$-parameter sets.
TS11.3 Attacking and Securing Hybrid Homomorphic Encryption Against Power Analysis 16:40
Aikata Aikata¹, Maciej Czuprynko¹, Nedžma Musovic¹, Emira Salkić¹, Sujoy Sinha Roy²

¹ Graz University of Technology ; ² TU Graz

We present the first power side-channel analysis of a Hybrid Homomorphic Encryption (HHE) tailored symmetric encryption scheme. HHE combines lightweight client-side Symmetric Encryption (SE) with server-side homomorphic evaluation, enabling efficient privacy-preserving computation for the client and minimizing the communication overhead. Recent integer- based HHE designs such as PASTA, MASTA, HERA, and Rubato rely on prime-field arithmetic, but their side-channel security has not been studied. This gap is critical, as modular arithmetic and large key spaces in integer-based schemes introduce new leakage vectors distinct from those in conventional Boolean symmetric ciphers. In this work, we close this gap by presenting the first power side-channel analysis of an HHE-tailored scheme- HERA. Our results demonstrate a successful key recovery from as few as 40 power traces using Correlation Power Analysis. In addition to showing that such attacks are feasible, we develop the first masking framework for integer-based SE schemes to mitigate them. Our design integrates PINI-secure gadgets with assembly-level countermeasures to address transition leakage, and we validate its effectiveness using the Test Vector Leakage Assessment. Our experiments confirm both the practicality of the attack and the strength of the proposed countermeasures. We also demonstrate that the framework extends to other integer-based HHE schemes, by applying our technique to PASTA. Thus, we provide leakage models, identify relevant attack targets, and define evaluation benchmarks for integer-based HHE-tailored SE schemes, thereby filling a longstanding gap and laying the foundation for side-channel-resilient design in this area.
TS11.4 From Entropy to Leakage: A Unified Methodology for Security Evaluation of Caches 16:45
Pratik Shrestha¹, Achim D. Brucker¹, M. Khurram Bhatti¹

¹ University of Exeter

Cache Side-Channel Attacks (CSCAs) can leak sensitive information by exploiting shared cache resources. Although many secure cache designs like CEASER, ScatterCache, PhantomCache, MIRAGE, and IECache have been proposed, the security evaluation methods being used by these designs remain diverse, often inconsistent, and scattered. This inconsistency makes it challenging to compare the security strengths of the state-of-the-art cache designs for security-critical applications. To address this challenge, we propose a novel consistent security evaluation methodology, called the UniSEC (Unified methodology for Security Evaluation of Caches), which estimates Worst-Case Leakage (WCL) and provides a consistent, comprehensive, and realistic measure of potential information leakage that various cache designs exhibit. UniSEC empirically shows that WCL estimation maximizes the revelation of potential information leakage that Relative Eviction Entropy (REE) based method fails to capture. UniSEC introduces an Effective Security Score (ESS) that takes into account Active Attacker's Cache Lines (AACLs) within an attacker's eviction set and the uniformity of the eviction distribution across the AACLs to measure the worst-case leakage. Our results show that well-distributed eviction probabilities across attacker's eviction set lead to higher ESS and overall entropy. We carry out experiments to measure WCL, REE, and ESS in six state-of-the-art secure cache designs and vary associativity and cache sizes to measure the impact on information leakage. Our experiments reveal that security-critical applications cannot rely on the security guarantees being provided by REE alone. Therefore, WCL is a more realistic metric for measuring the actual amount of information leakage in caches.
TS11.5 FHEIns: Fully Homomorphic Encryption Acceleration for Large Data Applications with In-Storage Processing 16:50
Xuan Wang¹, Tianqi Zhang², Keming Fan³, Augusto Vega⁴, Minxuan Zhou⁵, Tajana Rosing²

¹ University of California San Diego ; ² UCSD ; ³ University of California, San Diego ; ⁴ IBM Research ; ⁵ Illinois Tech

Recently, the significance of data privacy protection has been growing rapidly. Homomorphic encryption (HE) enables computation directly on ciphertexts, making it attractive for privacy-sensitive databases in cloud datacenters. Although FHE enables privacy-preserving compute, ciphertext expansion and long-latency primitives drive up memory footprint and delay, worsening compute and memory pressure for database search. In practice, encrypted databases span hundreds of gigabytes to terabytes, making the storage I/O the dominant bottleneck. However, most prior FHE accelerators optimize on-chip computation and the main memory traffic while assuming working sets fit in HBM. Therefore, in this work, we present FHEIns, an in-storage processing architecture that executes FHE kernels close to data inside the NAND flash-based solid-state drives (SSDs) to exploit the internal bandwidth of the SSD. FHEIns achieves up to 24.7x and 2.67x speedup compared to the state-of-the-art FHE ASIC accelerators on trending FHE-based database benchmarks.
TS11.6 Make it Darker: A Gray Code Popcounter to Protect BNN CIM Against Power Attacks 16:55
Fouwad Mir¹, Asmae El arrassi², Abdullah Aljuffri³, Said Hamdioui², Mottaqiallah Taouil²

¹ Delft University of Technology (TU Delft) ; ² Delft University of Technology ; ³ King Abdulaziz City for Science and Technology

Binary Neural Networks (BNNs) have obtained a strong foothold in the field of machine learning at the edge due to their minimal hardware requirements. However, their energy and performance efficiency remain hindered by frequent data transfer between memory and processors. Computation-in-memory (CIM) architectures address this problem by embedding processing units within the memory. Unfortunately, current implementations of CIM are susceptible to IP piracy attacks through side channels. This paper presents a novel secure periphery scheme for NN accelerators with sequential accumulation that conceals IP information by obscuring the power consumption of the counter responsible for the leakage. This is achieved by combining two innovative techniques: operand schedule randomization and an always-count Gray code counter. The results demonstrate that the proposed design effectively resists power side channel attacks (SCAs). Moreover, Signal-to-Noise Ratio (SNR) and Test Vector Leakage Assessment (TVLA) show safe leakage levels. Compared to the state-of-the-art, our countermeasure reduces area and power overheads by up to 12.7× and 13.3×, achieving only 37% area and 51.2% power overhead with the added protection logic. Notably, this enhanced security comes with zero latency overhead, maintaining the performance of the baseline design.
TS11.7 Efficient Federated Learning with Low-Rank Updates under Homomorphic Encryption 17:00
Mohamed Aboelenien Ahmed¹, Mohamed Alsharkawy², Hassan Nassar³, Heba Khdr⁴, Jeferson Gonzalez⁵, Joerg Henkel⁵

¹ karlsruhe institute of technology ; ² Karlsruher Institut fur Technologie (KIT) - Karlsruhe Institute of Technology ; ³ Karlsruher Institut für Technologie ; ⁴ Karlsruhe Institute of Technology (KIT) ; ⁵ KIT

Federated Learning has been widely adopted for its ability to collaboratively train models without exposing raw data. However, the server-side aggregation process may still leak sensitive information about client data. Homomorphic Encryption enables privacy-preserving aggregation, but it introduces substantial communication overhead for clients and high computational costs for the server. To address these challenges, we propose HEAL-FL, a federated learning framework that is based on low-rank shared basis vectors across clients. Instead of transmitting full encrypted model updates, clients send only encrypted low-rank coefficients, thereby reducing both communication costs and server-side aggregation overhead. Furthermore, HEAL-FL incorporates a communication-efficient basis update scheme that relies exclusively on homomorphic addition at the server. Our evaluation across various homomorphic encryption schemes shows that HEAL-FL reduces client communication and server aggregation costs, leading to improved efficiency of Federated Learning systems. Notably, these savings translate into up to a significant reduction of 38.6% in total training time compared to conventional homomorphic FedAvg with full model parameter transmission, demonstrating the practical benefits of our approach.
TS11.8 Scytale: A Compiler Framework for Accelerating TFHE with Circuit Bootstrapping 17:05
Rostin Shokri¹, Nektarios Georgios Tsoutsos¹

¹ University of Delaware

Fully Homomorphic Encryption (FHE) offers strong cryptographic guarantees for secure outsourced computation, yet the performance of modern schemes like TFHE remains a barrier for complex applications. Existing TFHE approaches relying on programmable bootstrapping (PBS) are inefficient for large circuits, as they are limited to evaluating small (3-4 bit) lookup tables (LUTs). Our work introduces a novel compiler framework that overcomes this limitation by integrating circuit bootstrapping (CBS) and vertical packing (VP) to enable the evaluation of circuits composed of LUTs up to 12 bits. Our framework, built upon MLIR, introduces new dialects for CBS and VP and leverages Yosys for circuit synthesis, automating the translation from high-level programs to optimized TFHE circuits. Furthermore, we propose bespoke optimization passes that combine shared LUTs to minimize the overall cryptographic operations required. Experimental results demonstrate that our CBS-based design achieves execution times several times faster than the baseline PBS-only approach, highlighting the practical benefits of combining CBS and VP with compiler-driven circuit-level optimizations.
TS11.9 SPOILER-GUARD: Gating Latency Effects of Memory Accesses through Randomized Dependency Prediction 17:10
Gayathri Subramanian¹, Girinath P¹, Nitya Ranganathan¹, Kamakoti Veezhinathan¹, Gopalakrishnan Srinivasan²

¹ Department of Computer Science and Engineering, Indian Institute of Technology, Madras ; ² IIT Madras

Modern microprocessors depend on speculative execution, creating vulnerabilities that enable transient execution attacks. Prior defenses target speculative data leakage but overlook false dependencies from partial address aliasing, where repeated squash and reissue events increase the load–store latency, which is exploited by the SPOILER attack. We present SPOILER-GUARD, a hardware defense that obfuscates speculative dependency resolution by dynamically randomizing the physical address bits used for load–store comparisons and tagging store entries to prevent latency-amplifying misspeculations. Implemented in gem5 and evaluated with SPEC 2017, SPOILER-GUARD reduces misspeculation to 0.0004 percent and improves integer and floating-point performance by 2.12 and 2.87 percent. HDL synthesis with Synopsys Design Compiler at 14 nm node demonstrates minimal overheads - 69 ps latency in critical path, 0.064 square millimeter in area, and 5.863 mW in power.
TS11.10 CIRCE: CROSS Integrated RISC-V Cryptographic Extension 17:11
Alessandra Dolmeta¹, Valeria Piscopo¹, Maurizio Martina¹, Guido Masera¹

¹ Politecnico di Torino

Post-Quantum Cryptography (PQC) is moving from algorithm selection to deployment, where performance, energy, and software portability are key constraints, especially on embedded and IoT-class processors. Many PQC schemes stress general-purpose cores with irregular control flow, large arithmetic workloads, and heavy memory traffic. Instruction-set extensions (ISE) and tightly integrated accelerators offer a practical middle ground: they speed up dominant kernels while preserving programmability and avoiding the rigidity of fully fixed-function hardware. In this context, we target post-quantum digital signatures, which remain under active evaluation, including NIST's 2023 call for additional schemes. We focus on CROSS, a code-based signature built from zero-knowledge proofs and the Restricted Syndrome Decoding Problem, and present CIRCE: a RISC-V–integrated extension connected through the Core-V eXtension Interface (CV-X-IF). CIRCE supports both R-SDP and R-SDP(G), runs across all official parameter sets without hardware retuning, and achieves an average 2x speed-up on a Zynq UltraScale+ FPGA with an ultra-compact footprint (down to 800 LUTs / 100 FFs).

TS12 Design Automation for Quantum Computing 16:30 - 18:00 | Aida

Chair: Ilia Polian (University of Stuttgart, DE)

Co-Chair: Aida Todri-Sanial (Eindhoven University of Technology, NL)

TS12.1 Quantum Circuit Compilation for Superconducting Bus-Resonator Architectures 16:30
Patrick Hopf¹, Lukas Burgholzer¹, Robert Wille¹

¹ Technical University of Munich

Superconducting quantum computers are fundamentally limited by restricted qubit connectivity. Bus-resonator architectures alleviate this constraint by enabling effective all-to-all interactions. This advantage, however, comes at the cost of significant operational overhead. Realizing the full potential of such hardware thus requires sophisticated compilation techniques that minimize the overhead. In this work, we present the first formalization of the underlying compilation problem for bus-resonator architectures amenable to so-called SAT-CP solvers. Utilizing this formalization yields optimal solutions for small quantum circuits. For larger instances, we propose a linear-time heuristic. Experimental evaluations confirm that the formalization renders optimal solutions tractable even for vast search spaces, and that the heuristic enables near-optimal compilation while scaling efficiently to circuits of practical size. Together, these contributions establish both a rigorous baseline and a practical pathway toward low-overhead compilation for superconducting bus-resonator devices.
TS12.2 Alternating ZX Circuit Extraction for Hardware-Adaptive Compilation 16:35
Ludwig Schmid¹, Korbinian Staudacher², Robert Wille¹

¹ Technical University of Munich ; ² Ludwig-Maximilians-University Munich

We present a novel quantum circuit extraction scheme that tightly integrates graph-like ZX diagrams with hardware-adaptive routing. The method utilizes the degrees of freedom during the conversion from a ZX diagram to a quantum circuit (extraction). It alternates between generating multiple extraction options and evaluating them based on hardware constraints, allowing the routing algorithm to inform and guide the extraction process. This feedback loop extends existing graph-like ZX extraction and supports modular integration of different extraction algorithms, routing strategies, and target hardware, making it a versatile building block during compilation. To perform numerical evaluations, a reference instance of the scheme is implemented with SWAP-based routing for neutral atom hardware and evaluated using various benchmark collections on small-to mid-scale circuits. The reference code is available as open-source, allowing fast integration of other extraction and/or routing tools to stimulate further research and foster improvements of the proposed scheme.
TS12.3 SurgeQ: A Hybrid Framework for Ultra-Fast Quantum Processor Design and Crosstalk-Aware Circuit Execution 16:40
Xinxuan Chen¹, Hongxiang Zhu², Zhaohui Yang³, Zhaofeng Su¹, Jianxin Chen⁴, Feng Wu⁵, Hui-Hai Zhao⁵

¹ University of Science and Technology of China ; ² University of Science and Techonlogy of China ; ³ The Hong Kong University of Science and Technology ; ⁴ Tsinghua University ; ⁵ Zhongguancun Laboratory

Executing quantum circuits on superconducting platforms requires balancing the trade-off between gate errors and crosstalk. To address this, we introduce SurgeQ, a hardware-software co-design strategy consisting of a design phase and an execution phase, to achieve accelerated circuit execution and improve overall program fidelity. SurgeQ employs coupling-strengthened, faster two-qubit gates while mitigating their increased crosstalk through a tailored scheduling strategy. With detailed consideration of composite noise models, we establish a systematic evaluation pipeline to identify the optimal coupling strength. Evaluations on a comprehensive suite of real-world benchmarks show that SurgeQ generally achieves higher fidelity than up-to-date baselines, and remains effective in combating exponential fidelity decay, achieving up to a million-fold improvement in large-scale circuits.
TS12.4 Code Division Multiplexing based Readout Scheme for Spin Qubits 16:45
Jean-Baptiste Casanova¹, Quentin Schmidt², Baptiste Jadot², Brian Martinez², Xavier Jehl³, Franck Badets², Yvain Thonnart¹

¹ Univ. Grenoble Alpes, CEA, List, F-38000 Grenoble, France. ; ² Univ. Grenoble Alpes, CEA, Leti, F-38000 Grenoble, France. ; ³ Univ. Grenoble Alpes, CEA, IRIG, Lateqs

Scaling up the number of qubits in fault-tolerant quantum computing calls for scalable multi-plexed control and schemes. This paper introduces a novel digital code multiplexing approach for charge read-out in quantum computing. Based on digital control anddemodulation, the proposed scheme eliminates the needfor sinusoidal or more complex waveform generation. Unlike conventional reflectometry, where the number of qubits dictates the number of physical resonant circuits,the proposed method only depends on digital clock frequency and target readout time, enabling a more compactand scalable implementation. This work demonstrates the proposed scheme by modulating two single-electrontransistors (SETs) through a transimpedance amplifier(TIA)-based readout chain operating at 4.2 K. Experimental results show that the proposed method reliably distinguishes SET conductivity states under multiplexing,achieving a bit-error rate (BER) below 10−3 within an integration time of 4.55 μs.
TS12.5 Quantum Circuit Synthesis Based on LimTDD 16:50
Xin Hong¹, Chenjian Li¹, Aochu Dai², Runhong He¹, Shenggang Ying¹

¹ Institute of Software, Chinese Academy of Sciences ; ² Tsinghua University

Quantum circuit synthesis is a crucial task in quantum computing, aiming to transform a given high-level logic quantum operation into a sequence of elementary quantum gates. Traditional synthesis methods often rely on special characteristics of quantum operations or complex mathematical operations. While effective, they tend to incur high computational costs because they do not fully exploit the underlying structure of the quantum operation. In this work, we introduce a novel synthesis approach leveraging the LimTDD (Local Invertible Map Tensor Decision Diagram) data structure, known for its high compression efficiency and ability to identify isomorphic structures within tensors. By utilising LimTDD, our algorithm achieves efficient synthesis for specific types of quantum circuits, significantly reducing computational overhead. Moreover, the ability to extract isomorphic operators allows for reducing the entanglements, making our method particularly effective in accelerating other synthesis algorithms. We demonstrate the efficacy of our approach through experiments, showing substantial improvements in gate count and synthesis time compared to existing methods. Our work not only provides a powerful tool for quantum circuit synthesis but also highlights the potential of LimTDD in advancing the field of quantum computing.
TS12.6 Quantum Hardware-Efficient Selection of Auxiliary Variables for QUBO Formulations 16:55
Damian Rovara¹, Lukas Burgholzer¹, Robert Wille¹

¹ Technical University of Munich

The Quantum Approximate Optimization Algorithm (QAOA) requires considered optimization problems to be translated into a compatible format. A popular transformation step in this pipeline involves the quadratization of higher-order binary optimization problems, translating them into Quadratic Unconstrained Binary Optimization (QUBO) formulations through the introduction of auxiliary variables. Conventional algorithms for the selection of auxiliary variables often aim to minimize the total number of required variables without taking the constraints of the underlying quantum computer—in particular, the connectivity of its qubits—into consideration. This quickly results in interaction graphs that are incompatible with the target device, resulting in a substantial compilation overhead even with highly optimized compilers. To address this issue, this work presents a novel approach for the selection of auxiliary variables tailored for architectures with limited connectivity. By specifically constructing an interaction graph with a regular structure and a limited maximal degree of vertices, we find a way to construct QAOA circuits that can be mapped efficiently to a variety of architectures. We show that, compared to circuits constructed from a QUBO formulation using conventional auxiliary selection methods, the proposed approach reduces the circuit depth by almost 40%. An implementation of all proposed methods is publicly available at https://github.com/munich-quantum-toolkit/problemsolver.
TS12.7 Optimal Compilation of Syndrome Extraction Circuits for General Quantum LDPC Codes 17:00
Kai Zhang¹, Dingchao Gao², Zhaohui Yang³, Runshi Zhou¹, Fangming Liu⁴, Zhengfeng Ji¹, Jianxin Chen¹

¹ Tsinghua University ; ² Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences ; ³ The Hong Kong University of Science and Technology ; ⁴ Peng Cheng Laboratory

Quantum error correcting codes (QECC) are essential for constructing large-scale quantum computers that deliver faithful results. As strong competitors to the conventional surface code, quantum low-density parity-check (qLDPC) codes are emerging rapidly: they offer high encoding rates while maintaining reasonable physical-qubit connectivity requirements. Despite the existence of numerous code constructions, a notable gap persists between these designs---some of which remain purely theoretical---and their circuit-level deployment. In this work, we propose Auto-Stabilizer-Check (ASC), a universal compilation framework that generates depth-optimal syndrome extraction circuits for arbitrary qLDPC codes. ASC leverages the sparsity of parity-check matrices and exploits the commutativity of X and Z stabilizer measurement subroutines to search for optimal compilation schemes. By iteratively invoking an SMT solver, ASC returns a depth-optimal solution if a satisfying assignment is found, and a near-optimal solution in cases of solver timeouts. Notably, ASC provides the first definitive answer to one of IBM's open problems: for all instances of bivariate bicycle (BB) code reported in their work, our compiler certifies that no depth-6 syndrome extraction circuit exists. Furthermore, by integrating ASC with an end-to-end evaluation framework---one that assesses different compilation settings under a circuit-level noise model---ASC reduces circuit depth by approximately 50\% and achieves an average 7x-8x suppression of the logical error rate for general qLDPC codes, compared with as-soon-as-possible (ASAP) and coloration-based scheduling. ASC thus substantially reduces manual design overhead and demonstrates its strong potential to serve as a key component in accelerating hardware deployment of qLDPC codes.
TS12.8 Ultra-Low Logical Depth Fault-Tolerant Quantum Circuit Synthesis via Lattice Surgery 17:05
Chien-Tung (Cherie) Kuo¹, Cheng-En Tsai¹, Chung-Yang (Ric) Huang¹

¹ National Taiwan University

We present FTQCS, a ZX-calculus–based fault-tolerant circuit synthesis framework that compiles quantum circuits into lattice-surgery schedules for surface-code execution. FTQCS uses global ZX rewrites to optimize circuits, align them with merge/split primitives, and generate space-aware layouts. For Clifford circuits, it automates constant logical-depth synthesis; for Clifford+T, it extends this strategy to efficiently incorporate non-Clifford resources. The framework reduces the asymptotic space–time costs to O(n^2), improving upon the O(n^3) complexity of prior approaches and delivering substantial practical improvements. The experimental results show that FTQCS achieves near-constant logical-depth, speeds up execution by over 148×, and cuts space–time cost up to 16.6× on benchmark circuits. Implemented as an extension of Qsyn, FTQCS provides a scalable, verifiable path from high-level circuits to resource-efficient lattice-surgery schedules.
TS12.9 Efficient Image Reconstruction Architecture for Neutral Atom Quantum Computing 17:10
Jonas Winklmann¹, Yian Yu², Xiaorang Guo¹, Korbinian Staudacher², Martin Schulz¹

¹ Technical University of Munich ; ² Ludwig-Maximilians-Universität München

In recent years, neutral atom quantum computers (NAQCs) have attracted a lot of attention, primarily due to their long coherence times and good scalability. One of their main drawbacks is their comparatively time-consuming control overhead, with one of the main contributing procedures being the detection of individual atoms and measurement of their states, each occurring at least once per compute cycle and requiring fluorescence imaging and subsequent image analysis. To reduce the required time budget, we propose a highly-parallel atom-detection accelerator for tweezer-based NAQCs. Building on an existing solution, our design combines algorithm-level optimization with a field-programmable gate array (FPGA) implementation to maximize parallelism and reduce the run time of the image analysis process. Our design can analyze a 256×256-pixel image representing a 10×10 atom array in just 115 μs on a Xilinx UltraScale+ FPGA. Compared to the original CPU baseline and our optimized CPU version, we achieve about 34.9× and 6.3× speedup of the reconstruction time, respectively. Moreover, this work also contributes to the ongoing efforts toward fully integrated FPGA-based control systems for NAQCs.
TS12.10 A Mathematical Exploration to Equivalence Checking of Quantum Circuits 17:12
You-Cheng Lin¹, Yi-Ting Li², Wuqian Tang³, Yung-Chih Chen⁴, Chia-Chieh Chu⁵, Chun-Yao Wang⁶

¹ ARCULUS SYSTEM CO., LTD. ; ² National Tsing Hua University ; ³ Department of Computer Science, National Tsing Hua University ; ⁴ Department of Electrical Engineering, National Taiwan University of Science and Technology; ARCULUS SYSTEM CO., LTD. ; ⁵ Department of Mathematics, National Tsing Hua University ; ⁶ Department of Computer Science, National Tsing Hua University; ARCULUS SYSTEM CO., LTD.

Simulation-based approaches to detecting the non- equivalence of quantum circuits are efficient since they usually conclude the result of non-equivalence faster than traditional methods. However, proving the equivalence of two quantum circuits remains challenging. As a result, this paper aims at analyzing simulation-based approaches and uncovering their potential and limitations in equivalence checking.
TS12.11 From Forest to Tree: Prioritizing the Maximum Additional Delay in AQFP Circuit Design 17:13
Yinuo Bai¹, Mingjia Fan², Tsung-Yi Ho³, Zhou Jin⁴

¹ Tufts University ; ² Super Scientific Software Laboratory, China University of Petroleum-Beijing ; ³ The Chinese University of Hong Kong ; ⁴ Zhejiang University

This paper presents a fast and scalable algorithm for buffer and splitter insertion in AQFP circuits. The method maps each wire to a homeomorphic graph, constructs an additional- delay-free multi-ary forest, and merges it into an optimal tree under delay and fanout constraints. The formulation guarantees per-wire optimality in terms of maximum additional delay, total additional delay, and internal node count. A circuit-level refinement further reduces redundant insertion by identifying and adjusting critical wires. On standard AQFP benchmarks, the proposed approach achieves 2.72×, 525.70×, and 1.33× speedups over [1], [2], and [3], respectively, while maintaining comparable insertion counts and logic depths.

BPA02 Memory-Aware Acceleration: Bridging Hardware and Software 16:30 - 18:00 | Nabucco

Chair: David Novo (CNRS, LIRMM, University of Montpellier, FR)

Co-Chair: Francisco Cazorla (Barcelona Supercomputing Center, ES)

BPA02.1 SCALER: A Stream-Aware Accelerator with Hierarchical Memory for Sparse LU Factorization on HBM FPGAs 16:30
Xin Xu¹, Zhiying Zhu¹, Zishu Li², Dan Niu³, Cheng Zhuo¹, Zhou Jin¹

¹ Zhejiang University ; ² Nankai University ; ³ Southeast University

Sparse LU factorization plays a pivotal role in many scientific and engineering applications. However, its inherent high sparsity and random non-zero distribution lead to irregular data dependencies and memory access patterns, leaving efficient acceleration on FPGAs largely unexplored. Recently, high concurrency of High Bandwidth Memory (HBM) has provided new opportunities for accelerating sparse LU factorization. Nonetheless, achieving high bandwidth utilization remains challenging given random dependencies and complex computation patterns. In this paper, we present SCALER, a high-performance sparse LU factorization accelerator on HBM FPGAs. SCALER employs a sparse storage format with vectorized packing for data coalescing, customizing HBM-compatible data streams to boost bandwidth utilization. A two-tier hierarchical memory module enhances access efficiency and data reuse by optimizing memory management and reducing redundant transfers. Furthermore, a multi-stage pipelined data prefetching mechanism hides latency, leveraging the overlap of HBM access stages to improve off-chip memory communication efficiency. Finally, a stream-aware synchronization strategy transforms irregular dependencies into hierarchical streaming access, efficiently maximizing parallelism. Evaluation on 11 matrices demonstrates SCALER's geometric mean (geomean) throughput, energy efficiency and bandwidth efficiency surpass cuDSS solver on NVIDIA Tesla V100 GPU by 1.79x, 4.20x and 5.12x, respectively. It also outperforms the cuDSS solver on NVIDIA RTX 4090 GPU by 1.44x, 3.05x and 4.12x for the same metrics.
BPA02.2 Torrent: A Distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement 16:50
Yunhao Deng¹, Fanchen Kong², Xiaoling Yi², Ryan Antonio³, Marian Verhelst²

¹ MICAS - KU Leuven ; ² KU Leuven ; ³ MICAS KU Leuven

The growing disparity between computational power and on-chip communication bandwidth is a critical bottleneck in modern Systems-on-Chip (SoCs), especially for data-parallel workloads like AI. Efficient point-to-multipoint (P2MP) data movement, such as multicast, is essential for high performance. However, native multicast support is lacking in standard interconnect protocols. Existing P2MP solutions, such as multicast-capable Network-on-Chip (NoC), impose additional overhead to the network hardware and require modifications to the interconnect protocol, compromising scalability and compatibility. This paper introduces Torrent, a novel distributed DMA architecture that enables efficient P2MP data transfers without modifying NoC hardware and interconnect protocol. Torrent conducts P2MP data transfers by forming logical chains over the NoC, where the data traverses through targeted destinations resembling a linked list. This Chainwrite mechanism preserves the P2P nature of every data transfer while enabling flexible data transfers to an unlimited number of destinations. To optimize the performance and energy consumption of Chainwrite, two scheduling algorithms are developed to determine the optimal chain order based on NoC topology. Our RTL and FPGA prototype evaluations using both synthetic and real workloads demonstrate significant advantages in performance, flexibility, and scalability over network-layer multicast. Compared to the unicast baseline, Torrent achieves up to a 7.88× speedup. ASIC synthesis on 16nm technology confirms the architecture's minimal footprint in area (1.2%) and power (2.3%). Thanks to the Chainwrite, Torrent delivers scalable P2MP data transfers with a small cycle overhead of 82CC and area overhead of 207μm² per destination.
BPA02.3 KirbyMM: Outer-Product Based Matrix Multiplication on ARMv9 Processor 17:10
Lanshu Huang¹, Han Huang¹, Zhiguang Chen¹, Yutong Lu¹

¹ Sun Yat-sen University

General Matrix Multiplication (GEMM) serves as a cornerstone of high-performance computing and has been extensively optimized across diverse architectures. With the increasing prevalence of ARM processors in embedded systems and high-end servers, ARMv9 introduces the Scalable Matrix Extensions (SME), delivering substantially higher computational throughput compared to conventional vector SIMD units like Neon and SVE. However, existing GEMM libraries on ARMv9 still encounter critical challenges, including the lack of analytical modelling, under-utilization of SVE's capabilities, and suboptimal cache efficiency. To address these issues, we propose KirbyMM, a general and portable implementation for GEMM optimization on ARMv9 architecture. KirbyMM presents three key contributions: 1) BiReg-CMR, an analytical model that fully exploits SME's potential; 2) SME-SVE hybrid routine tailored for edge cases; and 3) cache-friendly data packing and partitioning strategies that enhance data locality. Experimental results demonstrate that KirbyMM achieves speedups of 1.11x - 1.75x on general matrix sizes compared to vendor libraries across different CPU platforms, and up to 3.59x on edge cases.

FS02 Focus Session: Architecting Intelligence: Next-Gen Acceleration for Generative AI 16:30 - 18:00 | Auditorium

Chair: Akash Kumar (Ruhr University Bochum, DE)

Co-Chair: Behnaz Ranjbar (Ruhr University Bochum, DE)

Organizers:

Sudeep Sudeep Pasricha (Colorado State University, US)
Partha Partha Pratim Pande (Washington State University, US)

FS02.1 Hardware/Software Co-Design to Accelerate Generative AI Workloads on Heterogeneous Architectures 16:30
Pratyush Dhingra¹, Vibhanshu Sharma¹, Jana Doppa¹, Partha Pratim Pande¹

¹ Washington State University

Effective acceleration of Generative AI workloads is constrained by heterogeneous computational and memory characteristics. Large Language Models (LLMs), for example, are composed of diverse kernels, ranging from memory-bound attention mechanisms to compute-intensive feed-forward networks. Such variability leads to poor resource utilization and renders homogeneous architectures inefficient. This paper outlines a methodology to address this challenge using two synergistic approaches. First, we discuss the design of a three-dimensional (3D) manycore architecture enabled by heterogeneous integration. Specifically, the architecture leverages emerging non-volatile memory alongside CMOS-based computing cores to create an optimized kernel-to-compute mapping. Second, we explore a hardware and software co-design framework to enable fine-tuning of unimodal and multimodal LLMs. This framework ensures thermally robust inference when leveraging emerging non-volatile memories within the heterogeneous architecture. Experimental results on multiple LLMs demonstrate the efficacy of the 3D heterogeneous architecture and the co-design framework.
FS02.2 Focus Session: Accelerating Diffusion Models for Generative AI Applications with Silicon Photonics 17:00
Tharini Suresh¹, Salma Afifi¹, Sudeep Pasricha¹

¹ Colorado State University

Diffusion models have revolutionized generative AI, with their inherent capacity to generate highly realistic state-of-the-art synthetic data. However, these models employ an iterative denoising process over computationally intensive layers such as UNets and attention mechanisms. This results in high inference energy on conventional electronic platforms, and thus, there is an emerging need to accelerate these models in a sustainable manner. To address this challenge, we present a novel silicon photonics-based accelerator for diffusion models. Experimental evaluations demonstrate that our photonic accelerator achieves at least 3x better energy efficiency and 5.5x throughput improvement compared to state-of-the-art diffusion model accelerators.
FS02.3 Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models 17:30
Muhammad Shafique¹, Abdul Basit¹, Muhammad Abdullah Hanif², Alberto Marchisio¹, Rachmad Vidya Wicaksana Putra³, Minghao Shao⁴

¹ New York University Abu Dhabi (NYUAD) ; ² New York University Abu Dhabi ; ³ New York University (NYU) Abu Dhabi ; ⁴ Graduate Research Assistant

This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small- to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

YPP03 Ambition vs. Burnout: Can We Build Sustainable Careers in Research and Industry? (Panel) 17:30 - 18:30 | Figaro

Balancing ambition, productivity, and personal well-being is a growing concern for early-career researchers and engineers. In highly competitive academic and industrial environments, long hours and high expectations are often seen as the norm - but are they sustainable? This panel brings together speakers from academia and industry to share honest insights on managing workload, career pressure, and personal priorities. The session will explore practical strategies, cultural differences between sectors, and how to build a fulfilling, sustainable career in research and technology.

Chair: Christian Pilato (Politecnico di Milano, IT)

Organizers:

Sara Vinco (Politecnico di Torino, IT)
Florian Bilstein (Racyics, DE)

YPP03.1 Ambition vs. Burnout: Can We Build Sustainable Careers in Research and Industry? (Panel) 17:30
Mehdi Tahoori¹, Jeronimo Castrillon², catherine Le Lan³, Piedad Brox Jiménez⁴

¹ Karlsruhe Institute of Technology ; ² TU Dresden ; ³ SYNOPSYS ; ⁴ CSIC

Balancing ambition, productivity, and personal well-being is a growing concern for early-career researchers and engineers. In highly competitive academic and industrial environments, long hours and high expectations are often seen as the norm - but are they sustainable? This panel brings together speakers from academia and industry to share honest insights on managing workload, career pressure, and personal priorities. The session will explore practical strategies, cultural differences between sectors, and how to build a fulfilling, sustainable career in research and technology.

YPP04 PhD Forum 18:30 - 20:00 | Buvette

The PhD Forum is a great opportunity for PhD students to present their work to a broad audience in the system design and design automation community from both industry and academia, and to establish contacts for entering the job market. Representatives from industry and academia get a glance of state-of-the-art in system design and design automation. The PhD Forum is hosted by EDAA, ACM SIGDA, and IEEE CEDA.

Chair: Christian Pilato (Politecnico di Milano, IT)

YPP04.1 Design of Energy-Efficient, High-Throughput Reconfigurable Systems through Cross-Layer Approximation and In-Network Computing 18:30
Zahra Ebrahimi¹, Akash Kumar²

¹ Technische Universität Dresden ; ² Ruhr University Bochum

As technology scaling plateaus, sustaining high performance and energy-efficiency requires alternative design strategies. Modern signal-processing and machine-learning workloads across the edge-to-cloud continuum exhibit a natural tolerance to noise and computational imprecision, enabling the relaxation of strict accuracy constraints. This trend has fueled the rise of cross-layer approximation, where optimizations are applied across circuit, architecture, and application layers to extract additional performance. At the same time, the increasing scale and data intensity of emerging applications expose the limitations of traditional edge-cloud computing, particularly in meeting real-time, throughput, and energy-efficiency requirements. These challenges have motivated the shift toward in-network computing (INC), which exploits programmable network devices to process data in transit. Despite this need, existing approximation and INC techniques fall short: prior works often overlook INC constraints, target ASICs rather than reconfigurable fabrics, focus primarily on adders and multipliers while neglecting costly division, operate in non-pipelined SISD styles that limit throughput, omit architectural-level optimizations, and lack general frameworks for tuning approximation knobs in multi-kernel applications. In response, this thesis introduces a family of cross-layer approximation architectures tailored for reconfigurable FPGA- and CGRA-based fabrics. New approximate multipliers and dividers featuring fine-grained pipelining and SIMD execution models are proposed, along with a hybrid SIMD multiplier-divider that supports runtime functional and precision versatility without reconfiguration. These designs are incorporated into heterogeneous SIMD/MIMD CGRAs, establishing a unified cross-layer hierarchy. Building on this foundation, a cross-layer methodology is developed that applies kernel-wise sensitivity analysis to capture performance-QoR trade-offs, followed by a greedy heuristic that tunes approximation knobs to maximize performance under user-defined QoR constraints. Finally, the approach is extended to distributed and INC environments, enabling in-network acceleration of multi-kernel applications. By restructuring computations for programmable switches, the method reduces resource usage and communication overhead while improving real-time responsiveness, aligning with key 5G/6G performance targets.
YPP04.2 Efficient Compilation for Coarse-Grained Reconfigurable Arrays 18:30
Cristian Tirelli¹, Laura Pozzi¹

¹ USI Lugano

Coarse Grain Reconfigurable Arrays (CGRAs) are programmable accelerators that offer a compelling balance between flexibility and energy efficiency, making them attractive for compute intensive workloads under tight power and resource constraints. Their effectiveness, however, depends critically on the quality of the mapping, meaning how operations are placed on processing elements and scheduled in time. Existing compilation approaches largely rely on heuristics, which only explore a small portion of the mapping space and can miss high quality solutions, especially as architectures and workloads grow in size. My thesis work advances CGRA compilation through two formal mapping strategies. The first contribution is a satisfiability based formulation of the mapping problem, which uses a solver to explore the complete mapping space and produce high quality placements, schedules, and routes. The second contribution addresses scalability by separating the temporal and spatial components of the problem, first computing a schedule in time and then constructing a spatial mapping through a graph monomorphism search. Experiments across a range of CGRA sizes show that, together, these methods preserve mapping quality while reducing compilation time by orders of magnitude, making exact and formally grounded CGRA compilation practical for larger architectures and more complex workloads.
YPP04.3 A Study of Performance Optimization Techniques of Digital Integrated Circuit Test Generation System 18:30
Zhiteng Chao¹

¹ State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences CASTEST Co., Ltd.

The Electronic Design Automation (EDA) for digital integrated circuits covers major design steps such as logic synthesis, test synthesis, and physical design. The test synthesis EDA tool needs to support Design For Testability (DFT) and Automatic Test Pattern Generation (ATPG) to ensure comprehensive detection of circuit defects after manufacturing. The ATPG problem is NP-complete, and the increasing scale of digital integrated circuit design presents significant challenges to ATPG systems: testability of complex circuits decreases, fault coverage becomes difficult to guarantee; the number of test patterns grows dramatically, leading to high test costs; test generation time can range from days to weeks, becoming a bottleneck in the iterative design process. Improving the efficiency of test generation while ensuring fault coverage and compressing test data has become a critical research direction for test synthesis EDA technology.
YPP04.4 Towards Performance-driven Analog Layout Design 18:30
Peng Xu¹, Bei Yu¹

¹ The Chinese University of Hong Kong

Although automated analog layout synthesis tools exist, they often struggle to accurately model performance, especially in the presence of layout-induced parasitics and device mismatch. To overcome this limitation, this work proposes a unified, performance-driven analog place-and-route (PNR) methodology that tightly integrates optimization with machine learning. The framework first introduces an unsupervised, physics-guided circuit representation learning scheme. It then introduces a multi-objective analog placement engine with a differentiable performance predictor and a performance-driven routing framework that learns effective routing guidance.
YPP04.5 From Theory to Silicon: Mapping Nested Loops to Tightly Coupled Processor Arrays (TCPAs) 18:30
Dominik Walter¹, Jürgen Teich¹

¹ FAU

The increasing demand for computational performance has driven the development of specialized accelerators. A perfect match for a huge class of computationally intensive problems characterized by nested loop programs with affine data dependencies, such as those common in linear algebra, signal processing, and image processing, exists in the form of massively parallel arrays of tiny processing elements that process and communicate data locally. This work investigates such a class of processor arrays called Tightly Coupled Processor Arrays (TCPAs). For their programming, TCPA use a formal model of loop nests called Piecewise Regular Algorithms (PRAs) that decouples the semantics of algorithms from their low-level implementation. This allows for an iteration-centric mapping paradigm that exploits parallelism at the level of loop iterations. We show that this approach achieves up to 19x speedup over an operation-centric coarse-grained reconfigurable array. In addition, we propose a zero-overhead loop control mechanism and a loop-aware memory subsystem that schedules all required data transfers between external and on-chip memories in real time. These concepts are validated through FPGA and ASIC prototypes, including a 22 nm ALPACA chip delivering 537.6 GFLOPs at 700 MHz. Overall, these results demonstrate that TCPAs offer an efficient, scalable, and formally sound architecture for accelerating nested loops.
YPP04.6 Towards Ultra-reliable Automotive Systems-on-Chip 18:30
Giusy Iaria¹, Paolo Bernardi¹

¹ Politecnico di Torino

Modern automotive devices must meet increasingly advanced functionality by incorporating millions logic gates and scan flip-flops. The increasing complexity of these devices is putting a strain on traditional manufacturing testing methodologies, which although they succeed in ensuring the required high reliability standards, need new techniques to better deal with the growing complexity. This work details several proposed methodologies that aim to optimize the testing process of all life stages--starting from Wafer Sort and ending in the In-Field phase--of an industrial automotive device. The contributions aim at guaranteeing high reliability despite the increasing complexity of modern devices. Experimental results have been carried out using automotive devices and manufacturing data provided by STMicroelectronics, demonstrating practical applicability in industrial environments.
YPP04.7 Hardware Acceleration with Backdoor and Side-Channel Analysis for Lattice-based Cryptography 18:30
Suraj Mandal¹, Debapriya Basu Roy²

¹ IIT Kanpur ; ² Indian Institute of Technology Kanpur

Due to recent growth in the development of quantum computers, post-quantum cryptographic (PQC) algorithms have become the centre of attention, replacing the classical cryptographic algorithms like ECC/RSA, as these are vulnerable to quantum attacks. Among the PQC algorithms, CRYSTALS- Kyber (FIPS-203) and CRYSTALS-Dilithium(FIPS-204) are the two standardized algorithms recommended by NIST to be used as a quantum secure key encapsulation algorithm (KEM) and quantum secure digital signature algorithm (DSA), respectively. Both CRYSTALS-Kyber and CRYSTALS-Dilithium are module lattice (ML) -based algorithms that need large polynomial multiplication as well as several hash operations, which make the software implementations of these algorithms very slow. In this work, we have designed low-area and low-latency dedicated hardware accelerators for both these algorithms. Our proposed hardware accelerators have been tested and implemented on FPGAs. Furthermore, our proposed polynomial multipliers using Number Theoretic Transformation (NTT) can also be used for homomorphic encryption schemes and lattice-based algorithms. We have also performed the security analysis, including side-channel analysis and backdoor attacks on both of these algorithms
YPP04.8 Multi-objective Full-Process Design Space Exploration for Chiplet Heterogeneous Integration 18:30
Shixin Chen¹

¹ The Chinese University of Hong Kong

Chiplet heterogeneous integration has emerged as a pivotal solution to circumvent Moore's Law limitations, enabling flexible scaling, heterogeneous integration, and high component reusability for high-performance computing systems. However, its design process is plagued by four intertwined challenges that hinder optimal tape-out. First, the design space expands exponentially with parameters spanning die partitioning, interconnection topology, microarchitecture configuration, and packaging options, leading to inefficient blind exploration. Second, conflicting multi-objective requirements (performance, power, area, cost, and reliability) demand careful trade-offs, as metrics like power consumption correlate with thermal distribution and area directly impacts manufacturing cost. Third, the inherent trade-off between simulation speed and accuracy persists, where high-precision simulations are computationally prohibitive, while fast simulations lack sufficient fidelity or only cover partial metrics. Fourth, cross-layer mismatches between architectural design and physical implementation (\textit{e.g.}, advanced packaging-induced latency, floorplan-induced communication overhead) often render theoretical performance gains unachievable in practice. Addressing these challenges requires not only isolated optimizations but a holistic framework that integrates application guidance, efficient simulation, multi-objective optimization, and physical implementation—ensuring both exploration efficiency and final design quality.
YPP04.9 Near-Data Processing of Structured and Semi-structured NoSQL Data on FPGA-based SoCs 18:30
Tobias Hahn¹, Jürgen Teich¹

¹ Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)

The exponential growth of diverse data formats has created new challenges for efficient data processing systems. Contemporary NoSQL data formats such as JSON, Apache Avro, and Apache Parquet provide substantial benefits including rapid schema evolution, enhanced query performance, reduced storage footprint, and improved version compatibility, yet remain poorly optimized for traditional processing systems designed primarily for structured relational data. This work summarizes approaches to near-data processing of structured and semi-structured NoSQL documents on FPGA-based Systems-on-Chip (SoCs). A suite of specialized hardware accelerators is presented that efficiently handle the parsing, preprocessing, and transformation of complex NoSQL data formats, demonstrating significant performance improvements over conventional CPU-based solutions. The proposed accelerators are designed for minimal resource utilization while achieving high throughput, making them well-suited for integration into heterogeneous computing architectures such as SmartSSDs and SmartNICs. This enables efficient near-data processing setups that significantly reduce data movement overhead and improve overall system performance for NoSQL workloads.
YPP04.10 HW-SW co-design techniques for DT-based Ensemble models on embedded systems 18:30
Alessandro Verosimile¹, Marco D. Santambrogio¹

¹ Politecnico di Milano

The growing adoption of Artificial Intelligence of Things (AIoT) systems demands embedded architectures capable of executing complex Machine Learning (ML) workloads under strict constraints on latency, throughput, and resource utilization. Decision Tree (DT) Ensembles, such as Random Forests (RFs), remain state-of-the-art for heterogeneous tabular data and offer inherent parallelism suitable for hardware acceleration. However, their efficient deployment on resource-constrained devices requires dedicated co-optimization of both the model and its hardware architecture. This PhD project presents HW–SW co-design methodologies that unify training and architectural development to produce models that are simultaneously accurate and hardware-compliant. Building upon prior memory-centric designs, the proposed flow introduces a novel multi-pipeline design exploiting both vertical and horizontal parallelism. Thanks to optimized resource utilization strategies, heterogeneous-depth training, and automated spatial-architecture acceleration mechanisms, the resulting architecture significantly increases the number of DTs that can be supported on embedded platforms while significantly reducing inference latency, achieving up to a 46.8× improvement over state-of-the-art solutions with negligible accuracy loss. Lastly, the framework has been extended to perform the inference of Oblique Random Forests (ORFs). Thanks to a hardware-aware training procedure and to a novel circulant memory mapping scheme, this approach significantly improves accuracy and latency compared to traditional axis-aligned Random-Forest accelerators.
YPP04.11 AI-Driven Topology Synthesis and System-level Optimization of Analog/Mixed-Signal Circuits 18:30
Jiaqi Wang¹, Georges Gielen²

¹ KU Leuven ; ² MICAS-KU Leuven

Analog/Mixed-Signal (AMS) circuit design remains a significant bottleneck in the development of modern electronic systems. While digital design automation has matured significantly, analog design is still largely manual, relying heavily on the intuition of expert designers to navigate vast combinatorial search spaces . As system requirements for bandwidth and power efficiency become more demanding, traditional optimization methods (such as genetic algorithms) struggle with sample efficiency, and manual topology selection becomes increasingly intractable. My PhD research addresses these challenges by developing a comprehensive Artificial Intelligence (AI) framework for the system-level automation of AMS circuits. The work bridges the gap between optimization and synthesis through two core methodologies: Graph-Guided Reinforcement Learning (RL): Utilizing Graph Attention Networks (GATs) to optimize circuit parameters and enable knowledge transfer between dissimilar architectures. Generative AI for Topology Synthesis: Leveraging Large Language Models (LLMs) to automate the structural synthesis of circuits.
YPP04.12 High-Level Synthesis of speculative circuits 18:30
Dylan Leothaud¹

¹ Univ Rennes, IRISA

State-of-the-art High-Level Synthesis (HLS) tools are an easy way to fast-prototype and automatically generate hardware from a C or C++ behavioral description. Circuits generated with HLS tools may underperform due to static scheduling, which determines when each operation will be executed, but uses only compile-time information and assumes the worst case. To overcome the limitations of static scheduling, we develop SpecHLS, a speculative HLS tool that extends a state-of-the-art HLS tool to insert speculation mechanisms into the generated circuits. SpecHLS lacked key features studied during my PhD. (1) How and where to speculate in the generated circuit? (2) How to reduce the area overhead induced by the speculation mechanisms? (3) How to take resource conflict into account in the speculative circuits?
YPP04.13 Real-Time Emotion Recognition: Deep Physiological Feature Learning and Adaptive Edge Intelligence 18:30
Junjiao Sun¹

¹ Centro de Electrónica Industrial Universidad Politecnica de Madrid

Affective computing seeks to equip intelligent systems with the ability to perceive and respond to human emotions. Physiological signals acquired from wearable sensors—such as BVP, EDA/GSR, and SKT—offer a non-intrusive and privacy-preserving source of information for real-time affect monitoring. However, accurate emotion recognition from physiological data remains challenging due to cross-user variability, limited labeled datasets, and the computational and privacy constraints inherent to edge devices. This doctoral research develops a deep learning framework for non-intrusive physiological emotion recognition, with emphasis on structured feature representations, temporal modeling, personalization, and resource-efficient deployment. First, multi-channel physiological signals are transformed into two-dimensional feature maps, enabling the use of image-based neural architectures that better capture spatial and inter-signal relationships than traditional handcrafted approaches. Building on this representation, a hybrid CNN–LSTM model is proposed to jointly learn spatial descriptors and temporal dynamics, improving robustness and accuracy under real-world constraints. To address cold-start personalization—where new users provide no labeled data yet exhibit unique physiological patterns—this work introduces CLEAR, an adaptive clustering and fine-tuning method that enables immediate personalized inference with minimal supervision while remaining fully compatible with embedded hardware. Finally, CHEER, a self-supervised and transfer-efficient cloud-to-edge framework, is developed to overcome data scarcity and hardware limitations. CHEER pre-trains lightweight GCN-based encoders using unlabeled cloud data, assigns new users through clustering, and trains only a compact on-device classifier with a few labeled samples, significantly reducing computation, memory, and energy usage while keeping raw data on-device—an essential property when dealing with sensitive physiological information.
YPP04.14 Co-Design of Lightweight Intrusion Detection Systems for IoT Devices. 18:30
Qingyu Zeng¹, Yuko Hara²

¹ Institute of Science Tokyo ; ² CNRS

The proliferation of Internet-of-Things (IoT) devices in smart homes and cyber-physical systems enlarges the attack surface and calls for intrusion detection that can run directly on resource-constrained edge devices. Existing intrusion detection systems (IDSs) are typically designed for servers and rely on heavyweight models and large memories, which makes them difficult to deploy in realistic IoT environments. This work targets the algorithm/architecture co-design of lightweight IDSs and IoT edge platforms. On the algorithm side, we design a compact feature representation and a lightweight detection model that can operate within tight memory and computation budgets. On the architecture side, we implement and optimize the runtime on embedded IoT platforms while preserving runtime performance and minimal computing overhead.
YPP04.15 Towards Autonomous Analog IC Design: Sample-Efficient, PVT-Robust, and Explainable Sizing via Agentic AI 18:30
Mohsen Ahmadzadeh¹, Georges Gielen¹

¹ KU Leuven

Analog and Mixed-Signal (AMS) circuit sizing remains a critical bottleneck in modern System-on-Chip design. As technology scales to advanced FinFET nodes, the manual effort required to satisfy strict performance constraints across Process, Voltage, and Temperature (PVT) variations becomes prohibitive. While Reinforcement Learning (RL) has emerged as a promising automation technique, current methods suffer from two fundamental flaws: extreme sample inefficiency and a "black-box" nature that prevents designer trust. This research presents a holistic framework to resolve these challenges through two novel contributions. First, to address efficiency and robustness, we introduce AnaCraft, a "Duel-Play" adversarial RL algorithm. Unlike standard cooperative methods, AnaCraft trains sizing agents to compete against an adversarial PVT agent that actively searches for worst-case corners. Coupled with probabilistic Model-Based Policy Optimization (MBPO), this approach achieves fully PVT-robust designs with ~3x fewer simulations than state-of-the-art methods, as validated on a complex 7nm data receiver. Second, to bridge the interpretability gap, we present AnaFlow, the first agentic Large Language Model (LLM) workflow for analog sizing. By orchestrating specialized agents, AnaFlow mimics the cognitive workflow of a human expert. It performs iterative refinement using text-based reasoning rather than opaque numerical gradients. Experimental results demonstrate that AnaFlow matches the optimality of RL baselines while providing transparent, human-readable justifications for design trade-offs. Together, these works pave the way for autonomous design agents that are both mathematically robust and intelligible to engineers.
YPP04.16 Multi-objective Quantitative Modeling for Design Space Exploration on AI processors 18:30
Jiacong Sun¹

¹ KU Leuven

The rapid evolution of AI algorithms, combined with the slow design cycles of hardware, creates a critical need for fast and accurate design space exploration tools for AI accelerators. This work presents an analytical simulation framework that enables multi-objective design space exploration across a wide range of hardware architectures and workloads. The framework is structured around three interrelated research directions. First, we develop objective cost models that quantify performance, carbon footprint, and sparsity-aware behavior, validated against silicon measurements with less than 7% mismatch. Second, we develop a dataflow mapper that finds optimal scheduling for both deterministic and uncertainty-aware sparsity scenarios. Third, we develop an algorithm abstraction frontend that generalizes across diverse workloads—from CNNs and Transformers to emerging models such as diffusion models and Ising machines. All components are released as open-source tools.
YPP04.17 Exploiting Neuromorphic Computing for Effective, Efficient, and Secure Edge-Cloud Collaborative Computing System 18:30
Haomin Li¹, Li Jiang¹, Fangxin Liu²

¹ Shanghai Jiao Tong University ; ² Shanghai Jiaotong University

This research proposes an edge-cloud collaborative framework leveraging neuromorphic computing to address challenges in computational efficiency, scalability, and security for edge-cloud systems. The framework integrates four key components: (1) an efficient neuromorphic learning method, (2) a federated learning framework tailored for neuromorphic models, (3) robust security mechanisms and architectures for edge and cloud, and (4) an efficient neuromorphic computing framework optimized for edge devices. These components work synergistically: the federated learning framework orchestrates edge-cloud collaboration, the efficient neuromorphic model ensures high performance with low computational cost in the cloud, the computational framework optimizes edge inference, and advanced cryptographic and hardware-based protections secure the system.
YPP04.18 Design Tools for Adiabatic Superconducting Logic Circuits Toward Energy-Efficient Computing 18:30
Rongliang Fu¹

¹ The Chinese University of Hong Kong

Adiabatic Quantum-Flux-Parametron (AQFP) logic is a highly energy-efficient superconductor logic family that reduces dynamic energy dissipation through adiabatic switching driven by AC excitation currents. It overcomes the power limitations of traditional logic families like rapid-single-flux-quantum (RSFQ) and efficiently implements majority functions. However, its unique features, such as fan-out limitations and clock-synchronized data propagation, make circuit design more complex. Addressing these challenges requires advanced design automation. I have contributed to developing EDA tools for AQFP logic, focusing on logic synthesis, placement, and routing, to streamline circuit design, optimize layouts, and reduce energy consumption, moving AQFP logic closer to practical application.
YPP04.19 Agentic on Cybersecurity and LLM-Aided EDA: Benchmarks and Automated Systems 18:30
Minghao Shao¹

¹ Graduate Research Assistant

This thesis investigates how modern AI systems, particularly domain-specialized agentic systems and large language models (LLMs), can be designed, evaluated, and deployed across diverse domains. Although areas such as cybersecurity, electronic design automation (EDA), quantum computing, and machine learning security differ in their technical requirements, they share common challenges including the need for domain-aware datasets, reliable benchmarks, automated reasoning pipelines, and robust evaluation frameworks. Motivated by these shared challenges, this Ph.D. research advances AI system applications in four sections: (1) cybersecurity agentic systems, (2) AI-aided EDA, (3) LLM for quantum code generation, and (4) machine learning security covering jailbreak attacks and deepfake detection.
YPP04.20 Bridging the Gap Between Fine-Grained Architectures and Large Systolic Arrays 18:30
Andrea Belano¹, Francesco Conti¹

¹ University of Bologna

The wide variety of architectures found in modern neural networks and their rapid evolution highlight the need for a platform that is both performant with large matrices and capable of efficiently parallelizing smaller, distinct tasks. For these reasons, we propose NAUSICAA (Neural Acceleration Unit for Scalable Integration and Configurable Adaptive Architecture), a flexible multi-tile architecture that combines the flexibility of a granular architecture with the high arithmetic intensity of large systolic arrays. Through a hierarchical design coupling RISC-V cores with dedicated systolic arrays and a private memory, NAUSICAA enables near-zero-overhead pipelining of linear and non-linear layers. Furthermore, the local multicast network of each tile allows the hardware to dynamically adapt to the workload, delivering consistently high performance and efficiency across different kernels.
YPP04.21 Design Technology Co-Optimization of Emerging Storage Class Memories 18:30
Bowen Wang¹, Wim Dehaene²

¹ imec & KU Leuven ; ² KU Leuven

This thesis advances emerging memory technologies to overcome the energy and density limitations of conventional memories and to enable their efficient integration into modern computing systems. The research pursues two primary objectives: 1) developing compact modeling and design methodologies that bring non-volatile memories into standard very-large-scale integration (VLSI) circuit and computer architecture design flows; 2) demonstrating low-power, high-density memory solutions for bandwidth-intensive artificial intelligence (AI) accelerators. These objectives are addressed through two core researches that structure the main technical chapters of this dissertation. The first research enables a comprehensive array-level evaluation of power, performance, and area (PPA) within a design-technology co-optimization (DTCO) framework, for voltage-controlled magnetic anisotropy (VCMA) magnetoresistive random-access memory (MRAM). The second research investigates a three-dimensional (3D) indium gallium zinc oxide (IGZO) charge-coupled device (CCD) block memory as an on-chip storage solution for AI accelerators, with its PPA and architectural benefits evaluated using both DTCO and system-technology co-optimization (STCO) methodologies. Together, these contributions provide a unified design-enablement flow and a novel memory architecture, addressing key challenges in energy efficiency, density, bandwidth, and cross-layer technology-to-system integration for future high-performance and energy-efficient AI systems.
YPP04.22 Model and System Co-design for Distributed DNN Inference on Edge Heterogeneous Devices 18:30
Mingyu Hu¹, Amit Kumar Singh², Jonathon Hare¹, Geoff Merrett¹

¹ University of Southampton ; ² University of Essex

The growing deployment of deep neural networks on edge and IoT platforms requires efficient and reliable inference in heterogeneous, resource-constrained, and time-varying edge environments. Distributed DNN inference within local edge networks can preserve full model functionality, but it also introduces challenges in workload partitioning, communication overhead, and robustness to device dynamics and failures. This work investigates model–system co-design to address these challenges. We first propose HyPerEdge, a framework that characterises nonlinear and diverse computational performance and employs automated hybrid partitioning that jointly optimises inter-layer and intra-layer partitioning, achieving substantial latency and energy reductions on real edge hardware. To enhance reliability and flexibility, we introduce Fluid Dynamic DNNs, which use nested incremental training to build modular sub-networks capable of performing standalone inference or collaborating with other sub-networks to reconstruct a larger model, thereby enabling adaptive distributed inference and maintaining robustness under device failures. Finally, we present adaptive ensembles of Dynamic DNNs that integrate multi-device scheduling with per-device width selection using a deadline-aware optimisation scheme, enabling fine-grained accuracy–latency trade-offs. Together, these methods provide a comprehensive solution for efficient, reliable, and adaptive distributed DNN inference on heterogeneous edge devices.
YPP04.23 Ultra-low Latency and Extreme-throughput Neural Network Accelerators on FPGA 18:30
Atousa Jafari¹, Marco Platzner²

¹ researcher ; ² professor

The goal of this thesis is to study the feasibility, benefits, and limitations of mapping neural network models to FPGA platforms in the direct logic implementation style to achieve ultra-low latency, extreme throughput, and high energy efficiency. To reach this goal, novel and well-established techniques for pruning and quantization at the algorithmic level, as well as hardware-level approximations, are investigated. Through these methods, computational cost is reduced, and redundant connections are eliminated, addressing the scalability issue. In terms of network models, this PhD project focuses on both feed-forward (FFNN) and recurrent neural networks (RNN). In particular, the Echo State Network (ESN), a widely used form of Reservoir Computing (RC), is explored as a promising alternative to conventional RNNs, offering a simpler, less computationally intensive, and more hardware-efficient solution for time-series analysis. Experiments for recurrent networks are conducted on two major categories of tasks: (i) time-series classification (TSC) and (ii) time-series forecasting (regression).
YPP04.24 Enabling Large-Scale RTL Simulation and Realistic System Integration 18:30
Guillem López-Paradís¹, Jonathan Balkind², Adrià Armejach³, Miquel Moreto⁴

¹ Barcelona Supercomputing Center ; ² UC Santa Barbara ; ³ BSC & UPC ; ⁴ BSC

During the last decade, heterogeneous system-on-chip (SoC) architectures have become universally popular, lately being further enriched with custom accelerators. However, available tools to simulate low-level RTL designs often overlook the specific target system in which the design will eventually operate. This hinders proper testing and debugging of functionalities, and does not allow co-designing the accelerator to obtain a balanced and efficient architecture. At the same time, despite the widespread adoption of highly parallel multicore processors, which currently offer up to hundreds of cores per chip, RTL simulation remains largely sequential. While most of the software domains take advantage of thread- and node-level parallelism, RTL simulators are usually restricted to a single node and only a handful of threads. This lack of parallelisation creates performance bottlenecks, especially as modern hardware designs grow in scale and complexity, ultimately slowing down verification and architectural exploration. In this work, we present two open-source tools developed to address these challenges. First, Metro-MPI is a distributed parallel RTL simulation framework that leverages MPI to accelerate simulation throughput across many cores and nodes significantly. Second, gem5+RTL, a framework that integrates RTL modules directly inside the gem5 full-system simulation environment, enabling cycle-accurate accelerator modelling within realistic, OS-driven workloads. We conclude with remarks on future directions and how these tools have been deployed to e.g., accelerate design-space exploration studies.
YPP04.25 Architectural Defenses against Hardware Attacks on RlSC-V cores 18:30
Songqiao Cui¹, Ingrid Verbauwhede², Josep Balasch³

¹ KU Leuven ; ² KU Leuven - COSIC ; ³ Rambus

Hardware attacks are powerful techniques to extract secret data from electronic devices. By monitoring measurable physical signals at run-time or by actively tampering with a device, an adversary can uncover sensitive application data. The IoT and embedded devices are particularly susceptible to these attacks, as edge nodes are commonly assembled with general-purpose microcontrollers that lack any dedicated protection. While countermeasures exist, implementing them in software is far from optimal. A more promising approach is to integrate them at hardware level. This PhD aims to design, implement and evaluate hardware architectural extensions that provide built-in resistance against hardware attacks. The research activities are centered around RISC-V, an open instruction set architecture that can be freely extended with application-specific modules. This PhD investigates how to prevent data leakages by using a combination of protected execution units and instruction randomization. Additionally, the architectural solutions are extended to additionally detect intentional data tamperings. The performance and security guarantees of the resulting architectures are evaluated and demonstrated using FPGA platforms.
YPP04.26 Deeply Understanding and Efficiently Mitigating DRAM Read Disturbance via New Testing Infrastructure and Comprehensive Real-Chip Experimental Studies 18:30
Ataberk Olgun¹, Onur Mutlu¹

¹ ETH Zurich

Read disturbance (e.g., RowHammer and RowPress) in modern DRAM chips is a widespread phenomenon and is reliably used for breaking memory isolation, a fundamental building block for building robust systems. DRAM chips are increasingly vulnerable to read disturbance phenomena due to DRAM technology scaling. Even though many prior works develop various read disturbance solutions, these solutions incur non-negligible and increasingly higher system performance, energy, and hardware area overheads as read disturbance worsens. This work advances the state-of-the-art by enabling insightful experimental studies using modern DRAM chips via an easy-to-use FPGA-based infrastructure, deepening our understanding of read disturbance in cutting-edge High Bandwidth Memory DRAM chips via rigorous real DRAM chip characterization, uncovering a new read disturbance phenomenon demonstrating that it is challenging to reliably determine the read disturbance susceptibility of a DRAM chip, and designing a new low- cost and low-overhead read disturbance solution.
YPP04.27 Application-Specific Hardware Optimization using Virtual Prototypes 18:30
Jan Zielasko¹, Rolf Drechsler²

¹ Cyber-Physical Systems, DFKI GmbH ; ² University of Bremen/DFKI

Identifying the optimal hardware configuration for running complex workloads such as Neural Network inference on ultra-low-power edge devices is critical for reducing cost and maximizing performance. Tailoring hardware designs to specific applications significantly increases resource utilization, which is essential to meet the strict performance and energy constraints. Unfortunately, exploring the design space at the hardware-level is challenging due to the complexity and time-consuming nature of hardware design processes. In this PhD work, we present an analysis platform based on a RISC-V Virtual Prototype (VP) to systematically identify fine-grained hardware optimization opportunities. The VP models the entire hardware platform while remaining fast and accessible. Combined with a custom executiontrace compression and analysis framework, it enables the capture and processing of billions of executed instructions. Applied to a wide range of representative embedded and edge artificial intelligence workloads from the Embench and MLPerf Tiny benchmark suite, our approach successfully identifies promising optimization opportunities beyond the matrix multiplication kernel that are non-trivial to detect from either the source code or gate-level analysis.
YPP04.28 A New Era of EDA: A Gen-AI-Aided Framework for Hardware Development 18:30
Kaiyuan Yang¹, John Goodenough¹, Tiantai Deng¹

¹ The University of Sheffield

AI workloads need specialised accelerators with short time to market across cloud, edge, and terminal devices. However, conventional EDA is slow across abstraction levels, adds physical awareness late, and relies on manual system partitioning. My dissertation addresses these gaps with three elements: Natural-Level Synthesis (NLS), a new design stage from high-level intent to synthesis ready RTL; AI-SoC, a framework for constraint aware system partitioning; and XChip, a generative AI aided toolchain that carries natural language system descriptions through these stages to RTL and GDSII with simple internal checks. I define an evaluation benchmark that combines Quality of Generated Hardware (QGH) from synthesis PPA with Required Design Effort (RDE) from prompt, code, and adjustment measures. Across a range of accelerator designs, the framework reduces design effort while maintaining or improving PPA and yields early partitions that better match real constraints.
YPP04.29 Side-Channel Awareness in Neural Network FPGA Accelerators: Security Threats and Opportunities for Functional Safety 18:30
Vincent Meyers¹, Mehdi Tahoori¹

¹ Karlsruhe Institute of Technology

Neural network accelerators on FPGAs are increasingly deployed in multi-tenant and resource-constrained environments, where data-dependent switching activity exposes them to physical side-channel leakage. This thesis analyzes how such leakage affects confidentiality and reliability in realistic FINN-generated accelerators and develops methods that exploit or mitigate these effects. We show that accelerator implementation details such as folding significantly shape leakage characteristics. A naive profiled attack recovers layer sizes with only 44% accuracy, while our folding-aware method achieves 100%. We further introduce a generative attack that reconstructs input images from single power traces across devices and conditions and a Trojan-based training approach that boosts output-recovery accuracy by up to 33%. On the defensive side, we propose side-channel-aware training using differentiable power models to suppress leakage and demonstrate reliable remote fingerprinting using an abstract power model. We also show that on-chip voltage sensors enable concurrent out-of-distribution and hardware-fault detection with negligible overhead. Overall, the results highlight the dual role of side-channel information as both a security risk and a lightweight source of runtime safety signals.
YPP04.30 Towards Lightweight Authentication: PUF‑enabled Mutual Trust and Key-exchange for IoT 18:30
Chandranshu Gupta¹, Gaurav Varshney¹

¹ Indian Institute of Technology Jammu

The widespread adoption of the Internet of Things (IoT) has brought billions of constrained devices into security-critical environments, yet most cannot support the computation, storage or energy required for conventional cryptographic infrastructures. As a result, many deployments lack robust authentication and secure key establishment, creating significant vulnerabilities. Physical Unclonable Functions (PUFs) offer a promising foundation for addressing these limitations through hardware-embedded uniqueness that avoids storing long-term secrets. This dissertation develops a unified PUF-based framework to provide scalable and lightweight trust for IoT systems. It presents an offline authentication protocol for BLE and Zigbee devices using an Arbiter PUF, a certificate-less PUF-assisted public-key mechanism that eliminates stored certificates, and an SRAM-PUF logic-locking scheme that generates stable device keys for authentication and IP protection. Together, these contributions demonstrate a practical pathway for secure IoT operation without relying on heavyweight cryptography.
YPP04.31 Generative AI in the Hardware Design Flow: From High-Level Synthesis to Security 18:30
Luca Collini¹, Ramesh Karri²

¹ NYU Tandon School of Engineering ; ² NYU

Modern silicon technologies, such as sub-10 nm process nodes, 3D-stacked wafers, chiplet-based architectures, and in-package memory, have enabled increasingly energy-efficient and high-performance chips. These advances support new software applications and products, including large-scale AI workloads, real-time edge processing, and immersive augmented or virtual reality, which in turn drive greater demand for specialized hardware. At the same time, these technologies significantly increase design complexity and cost. Developing a chip from specification to prototype can require over $100 million and at least twelve months. Furthermore, as more computation moves to the edge and devices handle growing volumes of user data, security must be considered starting from the hardware level. Security validation and verification often fall on the critical path of the design process, potentially increasing time to market.
YPP04.32 Enhancing AI Systems Safety and Reliability 18:30
Vittorio Turco¹, Matteo Sonza Reorda¹, Annachiara Ruospo¹

¹ Politecnico di Torino

The widespread adoption of Deep Neural Networks (DNNs) in safety-critical domains demands rigorous reliability guarantees against random hardware faults. This research proposes robust approaches for on-line fault detection, providing a software-level and cost-effective alternative to standard reliability methodologies. The works initially introduce early detection techniques based on tensor-related metrics applied to intermediate Output Feature Maps (OFMs), enabling the identification of faults before they corrupt the final output. This methodology is subsequently refined into a two-phase monitoring strategy for classification tasks, combining embedding analysis with on demand Image Test Libraries (ITLs) to identify hard-to-detect faults. Expanding beyond classification, the research addresses complex semantic segmentation tasks by introducing the APSS (Area, Position, Symmetry, Shape) metrics, which perform realtime, "golden-free" output validation. Finally, to enable fair and reproducible assessments of these detection capabilities, a standardized Benchmark Suite has been established, providing a common ground for future reliability studies across different frameworks.
YPP04.33 Leveraging Machine Learning in Physical Design Implementation and Verification 18:30
Pooja Beniwal¹, Sneh Saurabh²

¹ Indraprastha Institute of Information Technology Delhi (IIIT-Delhi) ; ² Indraprastha Institute of Information Technology

Electronic design automation (EDA) comprises software tools for designing modern integrated circuits, synthesis, placement, routing, timing analysis, and verification. As designs scale to billions of transistors, challenges arise from increased complexity, tighter timing margins, higher power-performance-area (PPA) demands, and long runtimes. Machine learning (ML) addresses these issues by learning complex patterns from large design datasets, enabling faster inference and improved optimization. Motivated by these advantages, ML is integrated into static timing analysis (STA) and VLSI physical design in this research to improve accuracy, reduce pessimism, and design effort, and enable more scalable design flows.
YPP04.34 On Cutwidth: Polynomial-Time Formal Verification, Debugging, and Correction for Circuits 18:30
Mohamed Nadeem¹, Rolf Drechsler²

¹ University of Bremen ; ² University of Bremen/DFKI

Ensuring the correct functionality of digital circuits is a critical focus for both academic and industrial research. Formal Verification (FV) is a well-established technique for achieving complete circuit correctness. However, as circuit complexity grew, verification methods faced significant challenges in establishing resource bounds, making verification impractical for many designs. To address this, Polynomial Formal Verification (PFV) is introduced to provide bounded time and space for verification. In our research, we proposed Linear-Time Formal Verification (LFV) as a subclass of PFV for circuits with constant Cut Width (CW) and proved that verification of combinational and sequential designs can be performed in linear time for both exact and approximate configurations in binary and Multi-Valued Logic (MVL) domains, improving on previously known higher-degree polynomial bounds. We further defined Polynomial Debugging and Fault Correction (PDFC), a subclass of Debugging and Fault Correction (DFC) for circuits with constant CW, and showed that it could be solved in polynomial time and space, thereby enabling efficient debugging and related Electronic Design Automation (EDA) tasks that had not been tackled before.
YPP04.35 Design techniques for AI-enabled embedded systems 18:30
Giovanni Pollo¹, Enrico Macii¹, Daniele Jahier Pagliari¹, Sara Vinco¹, Alessio Burrello²

¹ Politecnico di Torino ; ² Politecnico di Torino and UniversitÃ di Bologna

Designing AI-enabled embedded systems is challenging due to tight resource budgets and complex interactions between hardware and software. On the one hand, on-device AI models must meet strict memory, latency, and energy constraints while delivering accurate real-time decisions which requires dedicated model optimization techniques. On the other hand, accurate system-level modeling is essential to capture closed-loop interactions between algorithms, hardware, and the surrounding environment, and to assess how model-level choices impact end-to-end system behavior. The purpose of this PhD thesis is to support the design of embedded AI systems in complex domains such as biosignals processing and robotics. The first main contribution is an automated Deep Neural Network (DNN) optimization flow, tailored for embedded deployment, which combines Neural Architecture Search (NAS), structured pruning, and quantization to explore accuracy-efficiency trade-offs under tight constraints. The second contribution is a Virtual Prototyping (VP) framework that offers a detailed virtual representation of the target platform, enabling Design-Space Exploration (DSE) of hardware and software options.
YPP04.36 AI-Powered Resilient Perception for Autonomous Systems: Lightweight Near-Sensor Point Cloud Corruption Detection 18:30
Grafika Jati¹, Martin Molan¹, Francesco Barchi², Andrea Acquaviva¹

¹ University of Bologna ; ² Università di Bologna

LiDAR is a critical sensor for autonomous vehicles, helping them detect and understand their surroundings. But what happens when the sensor itself is compromised? In real-world driving, LiDARs can be affected by everyday contaminants like water, mud, dust, or engine oil. These substances can distort the data, causing the vehicle to miss objects—or worse, see things that aren't there—with dangerous levels of confidence. Most current AI models are trained in ideal, clean conditions and are not prepared for these real-world challenges. Our research tackles this problem by creating the first dataset of real-world corrupted point cloud data affected by physical contamination. We also develop an intelligent safety layer that detects when LiDAR data is unreliable—before it affects critical decisions like braking or turning. Designed to run on lightweight edge devices, our system is fast, efficient, and adaptable. In future work, we aim to make the system smarter through few-shot learning and uncertainty-based decision support, helping autonomous vehicles become safer and more trustworthy in the messy, unpredictable real world. Data and code will be available at: https://gitlab.com/ecs-lab/lidaroc, https://gitlab.com/ecs-lab/anzil, and https://gitlab.com/ecs-lab/distilling-pointcloud-corruption.
YPP04.37 Efficient Spiking Neural Networks for Edge-Based Auditory Perception 18:30
Shreya Kshirasagar¹, Christian Mayr²

¹ Robert Bosch GmbH (Bosch Research) ; ² TU Dresden

The increasing demand for real-time auditory perception in resource-constrained environments, such as automotive edge devices, necessitates the development of efficient and robust neural network architectures. This PhD research focuses on leveraging Spiking Neural Networks (SNNs) to address the challenges of auditory perception, specifically targeting the detection of siren sounds in noisy environments. The proposed SpikeSireNet architecture demonstrates comparable accuracy while significantly reducing model size and computational requirements compared to conventional Recurrent Neural Networks (RNNs).

REC Welcome Reception 18:30 - 20:00 | Buvette

2026-04-21

TS13 Test Generation and Fault Detection for AI/ML and Hybrid Hardware Systems 08:30 - 10:00 | Tosca

Chair: Annachiara Ruospo (Politecnico di Torino, IT)

Co-Chair: Maria K. Michael (Electrical and Computer Engineering/KIOS Center of Excellence, University of Cyprus, CY)

TS13.1 Gohan: A Golden-Copy-Aided Platform Enabling Online Hybrid-Interactive Reliability Analysis 08:30
Quan Cheng¹, Haoyuan Li², Wang LIAO³, Feng Liang⁴, Longyang Lin⁵, Masanori Hashimoto⁶

¹ Brown University ; ² Xi'an Jiaotong University / Kyoto University ; ³ Kochi University of Technology ; ⁴ Xi'an Jiaotong University ; ⁵ Southern University of Science and Technology ; ⁶ Kyoto University

Ensuring reliable operation of modern silicon systems in safety-critical domains requires fault injection (FI) platforms that combine accuracy, observability, and efficiency. Traditional simulation-based FI provides full observability but is prohibitively slow, while hardware-based FI improves speed but lacks cycle-level precision, cross-domain support, and comprehensive monitoring. To address this, this work presents Gohan, a golden-copy-aided platform that enables online, hybrid-interactive reliability analysis across multi-clock-domain systems. It introduces a per-domain golden copy framework, in which the golden copy is generated independently for each domain through simulation to preserve cycle-accurate state transitions. In addition, an FPGA-based host–DUT co-execution loop is used, incorporating clock domain-crossing (CDC)-aware pause-resume mechanisms and scan-chain-based FI. Experimental results on both lightweight RISC-V cores and complex AI processor demonstrate that Gohan achieves 100% consistency with simulation models under repeated pause–resume operations and fault campaigns, while providing 3 orders-of-magnitude speedup over pure simulation. By bridging simulation accuracy and hardware realism, Gohan offers a scalable, low-cost, and high-fidelity solution for reliability evaluation at pre-silicon stage.
TS13.2 SNIFFER: RL-based Vendor-Agnostic Test Case Generation for Triggering Long-Latency Behaviors 08:35
Mingyu Pi¹, Michael Yun¹, Jaeseung SEOK², Sangmin Kim¹, Sunghee Lee³, JINHWA LEE¹, Yoon Hyeok Lee⁴

¹ Samsung Electronics ; ² SAMSUNG Electronics ; ³ Samsung Electronics. DS ; ⁴ AI Center, Samsung Electronics

Preventing unexpected long-latency spikes is crucial for latency-sensitive hardware systems like Solid-State Drives (SSDs). Conventional test case (TC) generation methods often lack reproducibility and rely on proprietary internal firmware knowledge, limiting their applicability. To address this, we propose SNIFFER, a vendor-agnostic, black-box framework that utilizes Reinforcement Learning (RL) for the automated generation of latency anomaly-inducing TCs. SNIFFER interacts directly with real SSD hardware, using only externally observable metrics from standardized tools like Flexible I/O and Open Compute Project. Our framework formulates the problem as a sequential decision-making process, enabling an RL agent to learn complex I/O patterns that induce stress. We demonstrate that SNIFFER consistently generates effective TCs, inducing up to 74.7% higher maximum latency in up to 85% fewer steps compared to a random baseline. More importantly, we demonstrate its superiority over alternative black-box optimization methods, such as Genetic Algorithms, validating our approach for non-stationary hardware environments. SNIFFER 's ability to reproducibly generate diverse and stressful TCs makes it a powerful tool for automated industrial validation pipelines.
TS13.3 Explainable GNN-Driven Test Point Insertion on Uncontrollable I/Os 08:40
Sung-Hyuk Cho¹, Tae-Min Park¹, Jeongyeol Lee¹, Jae-Youn Hong¹, Andreas Gerstlauer², Joon-Sung Yang¹

¹ Yonsei University ; ² The University of Texas at Austin

Test coverage degradation from uncontrollable I/Os is a critical challenge in modern SoC design. In area-sensitive applications, such as the peripheral circuits of memory devices, standard DFT solutions like wrapper chains are prohibitively expensive due to their high area overhead. This necessitates a surgical Test Point Insertion (TPI) strategy that maximizes testability while adhering to strict cost constraints. To address this challenge, we propose a novel TPI framework using an explainable Graph Neural Network (GNN). Our GNN accurately predicts test coverage in circuits with masked I/Os, and an integrated saliency map (XAI) technique then identifies the most critical I/Os for TPI. Compared to a leading commercial tool, our framework achieves the target test coverage with 7.53% fewer TPs and improves coverage by 4.34% with the same TP budget on average. The scalability on large circuits (>100k gates) and technology independence confirm its practical applicability for minimizing die cost in constrained, real-world designs.
TS13.4 Design for testability using mixed-polarity flip-flops and latches 08:45
Lorenzo Lagostina¹, Jordi Cortadella², Mario R. Casu³, Luciano Lavagno¹

¹ Politecnico di Torino ; ² Universitat Politecnica de Catalunya ; ³ Politecnico di Torino, Department of Electronics and Telecommunications

Sequential circuits employing a combination of mixed-polarity flip-flops and latches allow significant improvements in clock frequency compared to useful skew and retiming. However, no work addresses the task of enabling a scan- based test on a circuit optimized with such techniques, while simultaneously minimizing the area overhead due to shadow latches used to complete the scan chain when latches are used in the design. This poses a serious limitation to the industrial application of mixed FF and latch-based techniques, since post-fabrication tests are an unavoidable step in IC production. This paper presents a macro-cell structure to enable both the exploitation of time borrowing for frequency optimization and the execution of the scan test of a design. The proposed solution requires minimal changes in the test setup and is evaluated using a recent methodology, Mix&Latch. Moreover, the work proposes modifications to Mix&Latch that allow reusing the standard cells introduced for the scan test to solve hold timing violations, avoiding additional hardware overhead. Results show that the lumped cell structure does not significantly impact frequency gains, and the ILP formulation of latch and FF type optimization can be extended to cover the DFT optimization part, ensuring only a moderate increase in area and power consumption, comparable with the DFT impact on regular FF-based designs.
TS13.5 Adaptive Testing of Compute-in-Memory GANs Using Backpropagation-Guided Test Compaction 08:50
Anurup Saha¹, Ashiqur Rasul¹, Thomas Walton¹, Amirali Aghazadeh¹, Abhijit Chatterjee²

¹ Georgia Institute of Technology ; ² Georgia Tech

Generative adversarial networks (GANs) are promising for a range of applications, including image translation and denoising, as well as synthetic data generation. These applications can be mapped to memristive crossbar arrays (MCAs) for ultra-high energy efficiency and portability. However, conductance variation within analog crossbars degrades the quality of the GAN outputs and necessitates robust post-manufacturing testing. We propose a two-stage adaptive test framework for compute-in-memory (CiM) based GANs, comprising an exhaustive test and a compact test. The exhaustive test measures the inception score of a device under test (DUT) by applying a large number of noise vectors, called the exhaustive noise set. To reduce test time, a compact test estimates the inception score of a DUT from a carefully chosen subset of these vectors, called the compact noise set. The compact noise set is determined by a binary mask optimized with a novel backpropagation-guided algorithm to minimize the difference between the estimated and true inception scores of the DUTs. Finally, to leverage both the accuracy of the exhaustive test and the speed of the compact test, the proposed adaptive test framework first applies the compact test to every DUT. Only the DUTs that yield low confidence in classifications are then subjected to the exhaustive test. Experiments show that this adaptive approach achieves less than 1% test escapes while offering up to 7.26x speedup compared to exhaustive test.
TS13.6 A Reinforcement Learning Framework for Good Die in Bad Neighborhood Analysis 08:55
Mohammad Ershad Shaik¹, Abhishek Kumar Mishra², Nagarajan Kandasamy², Nur Touba³

¹ The University of Texas at Austin ; ² Drexel University ; ³ University of Texas at Austin

Good-Die-in-Bad-Neighborhood (GDBN) analysis is a critical challenge in semiconductor manufacturing, where overly aggressive rejection reduces yield, while lenient acceptance increases test escapes and outgoing defective parts per million (DPPM). This asymmetric trade-off creates a multi-objective optimization problem spanning defect coverage, yield preservation, and return-material-authorization cost, often beyond the reach of conventional gradient-based methods. In this work, we employ reinforcement learning to develop an attention-based Deep Q-Network (DQN) framework tailored for GDBN-driven decision making. The DQN agent learns an optimal die-level screening policy from local wafer patches along with numerical test parametric data, optimizing actions that maximize cumulative long-term reward. By incorporating an attention mechanism, our model captures neighborhood-aware spatial dependencies across dies, enabling context-sensitive decision-making that balances yield and quality. We evaluated our method on the publicly available WM-811K wafer dataset, demonstrating substantial improvements in DPPM reduction and yield–cost tradeoffs compared to existing approaches. The results demonstrate that reinforcement learning provides a scalable and effective solution for adaptive defect screening in high-volume semiconductor test environments.
TS13.7 Concurrent Fault Detection for Binary Neural Network Accelerators via On-Chip Voltage Monitoring 09:00
Vincent Meyers¹, Mahboobe Sadeghipourrudsari², Mehdi Tahoori¹

¹ Karlsruhe Institute of Technology ; ² Karlsruhe Institute of Technology (KIT)

As Neural Networks (NNs) are increasingly deployed in safety-critical edge and datacenter systems, ensuring reliable execution becomes essential. Runtime faults such as memory bit flips and faults in logic components can silently corrupt computations without triggering system-level alarms. Conventional detection methods often miss logic faults or incur significant overhead. We propose a lightweight, concurrent error detection method that monitors voltage fluctuation traces captured by on-chip sensors. Our hypothesis is that faults alter neuron activations and change the switching activity and thus the instantaneous voltage fluctuation profile during inference. These traces are classified using a threshold-based model, requiring no modifications to the NN hardware or inference pipeline. As our approach operates purely through side-channel observation, it functions as a non-intrusive wrapper applicable to a wide range of AI accelerators. We evaluate the method on two different FPGAs, demonstrating consistent efficiency across platforms and portability to cloud scenarios. It detects faults in under a second, making it suitable for real-time applications such as vision tasks running at 30–60 FPS. By repurposing voltage sensors as diagnostic tools, this work opens a new direction for functional safety in AI hardware.
TS13.8 Noise-Aware Adaptive Sampling for Robust Diffusion Models on Analog Compute-in-Memory 09:05
Yuannuo Feng¹, Wenyong Zhou², Yuexi Lv³, Hanjie Liu¹, Guangyao Wang¹, Zhengwu Liu², Ngai Wong², Wang Kang¹

¹ Beihang University ; ² The University of Hong Kong ; ³ Hangzhou Zhicun (Witmem) Technology Co., Ltd.

Diffusion models achieve state-of-the-art image generation but impose an extensive computational burden on classical digital computers. Recent advances demonstrate promising acceleration for diffusion models through compute-in-memory (CIM) architectures, but weight perturbations due to inevitable inherent noise result in severe performance degradation. To address this critical challenge, we present the first investigation of the vulnerability of diffusion models under noise conditions. From the perspective of the iterative denoising process, we found that reducing the sampling steps brings better robustness but sacrifices the versatility of generation, and that noise at earlier steps causes more severe degradation due to error accumulation. Based on these observations, we propose EtaMix, a novel noise-aware sampling strategy that interpolates between stochastic and deterministic sampling without training process and hardware design modifications. Specifically, we encourage more stochastic sampling at the beginning to offset the negative impact of weight perturbations and gradually reduce the stochasticity to deterministic sampling as the process progresses. Furthermore, we build a comprehensive noise model to simulate real-world chip noise conditions. Experimental results across various datasets demonstrate that EtaMix yields up to 2.01 times and 5.12 times improvement in Fréchet Inception Distance (FID) under different noise conditions for DDPM and DDIM, respectively.
TS13.9 Diagnostic Test Generation for Fault Localization in Printed Neuromorphic Circuits 09:06
Tara Gheshlaghi¹, Alexander Studt², Priyanjana Pal¹, Dina Moussa¹, Michael Hefenbrock³, Michael Beigl⁴, Mehdi Tahoori¹

¹ Karlsruhe Institute of Technology ; ² Karlsruher Institut of Technology ; ³ Perspix.ai ; ⁴ Karlsruhe Institute of Technology (KIT)

Abstract—Printed electronics (PE) enable lightweight, flexible, and low-cost devices for the Internet of Things (IoT) and wearable applications. Compared to conventional silicon-based electronics, PE trades peak performance for advantages in cost efficiency, mechanical flexibility, and large-area fabrication. However, its manufacturing processes remain unreliable and are prone to structural defects and variation due to inherent limited control in additive manufacturing. Printed neuromorphic circuits (pNCs) leverage the benefits of PE for on-demand analog edge computation in target applications but remain vulnerable to such defects. Diagnostic testing is therefore essential not only for detection but also for localizing faults to specific subcircuits and regions in the layout, a step critical for guiding yield improvement and reducing the cost of downstream inspection. We propose a diagnostic test pattern generation (DTPG) framework for fault localization in pNCs under black-box access. While ATPG is typically formulated as an optimization problem for fault detection, our approach extends this formulation by explicitly optimizing for fault distinguishability. On ten UCI datasets, the framework achieves up to 20.7% higher diagnostic coverage with a reduction of up to 3.6 times the number of undetectable subcircuits than detection-only test sets, while constraining the number of patterns to reduce storage overhead. These results demonstrate effective fault localization and establish a foundation for finergrained, component-level diagnosis in future work.

TS14 Logic and FPGA Synthesis 08:30 - 10:00 | Traviata

Chair: Tiziano Villa (Universita' di Verona, IT)

Co-Chair: Grace Li Zhang (TU Darmstadt, DE)

TS14.1 GPU-Accelerated Efficient Transduction for Logic Optimization 08:30
Zhuofan Lin¹, Shiju Lin¹

¹ The Hong Kong University of Science and Technology (Guangzhou)

Transduction is a powerful method for high-effort logic optimization. Unlike many local heuristics that focus on area-decreasing steps, transduction incorporates area-increasing transformations to restructure circuits, thereby uncovering unique opportunities for subsequent area reductions. Despite its potential in area optimization, transduction is computationally expensive, primarily due to the high runtime cost of computing don't-cares. To reduce its runtime and make it more practical, we present a GPU-accelerated fast transduction algorithm. We first explore how to maximize the parallelism of transduction, followed by GPU-friendly kernel optimization techniques for reduced memory consumption and improved performance. Compared to the state-of-the-art transduction implementation in ABC, our method achieves an average speedup of 130x while delivering superior and-inverter graph (AIG) results on the large benchmarks from the IWLS2022 Programming Contest. The source code of this work is available at https://github.com/Lin-HKUST-Guangzhou/gpu-transduction.
TS14.2 DynaOpt: A Heterogeneous Logic Optimization Framework with Dynamic Sequence Generation 08:35
Xingyu Qin¹, Guande Dong¹, Jianwang Zhai¹, Kang Zhao¹

¹ Beijing University of Posts and Telecommunications

Heterogeneous logic optimization improves circuit quality by partitioning a design and leveraging the best Directed Acyclic Graph (DAG) representation for each region. However, existing frameworks are limited by their reliance on applying fixed, pre-defined optimization scripts to these partitions. This approach fails to adapt to the specific structure of each partition or its impact on global circuit metrics. This paper introduces DynaOpt, a framework that overcomes this limitation by dynamically generating tailored optimization sequences. After partitioning the circuit with a timing and structure-aware algorithm and selecting the optimal DAG for each partition, DynaOpt discovers a bespoke optimization sequence for each sub-circuit. The key to this process is our novel, globally-aware fitness function, which guides a Genetic Algorithm (GA) by efficiently approximating the impact of local changes on the final circuit quality. Experiments demonstrate that DynaOpt achieves a significant improvement in Quality of Results (QoR) over the state-of-the-art (SOTA) framework. This validates the effectiveness of generating custom optimization sequences and addresses the fundamental limitations of relying on pre-defined sequences.
TS14.3 Exact Synthesis with Optimal Switching Activity 08:40
Marcel Walter¹, Michael Feldmeier¹, Robert Wille¹

¹ Technical University of Munich

Power consumption is a primary constraint in modern digital circuit design, with switching activity being a major contributor to dynamic power dissipation. While exact synthesis methods guarantee optimality for metrics like gate count or delay, they typically do not target switching activity directly. This paper presents a novel SAT-based exact synthesis approach designed to minimize switching activity in combinational logic circuits. We extend existing SAT encodings for logic synthesis, incorporating new constraints and variables to model and constrain the switching behavior of the circuit. Different SAT encoding strategies, including BDD-based approaches for handling cardinality constraints, and various search algorithms are explored. Experimental results on NPN benchmark functions demonstrate the effectiveness of the proposed method in identifying circuits with, on average, 6.7 % (over 30 % in the best case) reduced switching activity compared to traditional exact synthesis techniques, often achieving this reduction with no or minimal area overhead. While runtime remains challenging, this work establishes a foundation for power-aware exact synthesis.
TS14.4 HoPart: Hop-Constrained Partitioning with Routing Support for Multi-FPGA Systems 08:45
Yuan Huang¹, Longkun Guo², Weijie Fang³, Jiawei Lin³

¹ Fuzhou university ; ² Fuzhou University & Chinese Academy of Sciences Shenzhen Advanced Technology Academe ; ³ Fuzhou University

Multi-FPGA platforms are indispensable for VLSI emulation and prototyping, but remain fundamentally constrained by limited inter-FPGA I/O bandwidth. Techniques such as time-division multiplexing and FPGA hopping partially alleviate this bottleneck but substantially increase partitioning and routing complexity and exacerbate timing closure. As modern FPGA-based applications impose stringent timing budgets, design flows must be explicitly delay-aware. In this paper, we present HoPart, a Hop-constrained partitioning approach that enforces per-path hop limits. A core ingredient of our approach is the joint optimization of path delay and congestion during partitioning. In addition, we propose a routing algorithm that adaptively adjusts the number of edges (hops) along each path based on real-time criticality metrics. This strategy reduces interconnect resource usage on non-critical paths while minimizing delay on timing-critical ones. Extensive experiments on public benchmark suites demonstrate that HoPart reduces maximum path delay by up to 30% compared with the state-of-the-art MaPart, while maintaining efficient utilization of inter-FPGA interconnect.
TS14.5 QUADOL: A Quality-Driven Approximate Logic Synthesis Method Leveraging Dual-Output LUTs for Modern FPGAs 08:50
Jian Shi¹, Chang Meng², Xuan Wang¹, Weikang Qian¹

¹ Shanghai Jiao Tong University ; ² Eindhoven University of Technology

Modern FPGAs support dual-output LUT to reduce the area of FPGA designs. Several existing works explored the use of dual-output LUTs in approximate computing. However, they are limited to small-scale arithmetic circuits. To address this issue, we propose QUADOL, a quality-driven approximate logic synthesis (ALS) method leveraging dual-output LUTs for modern FPGAs. It can approximately merge two single-output LUTs into a dual-output LUT. The selection of LUTs for approximate merging is formulated as a maximum matching problem to maximize area savings. To further enhance existing ALS methods, we also propose QUADOL+, a generic framework to integrate QUADOL into existing ALS methods. Experimental results showed that QUADOL+ achieves significant area reduction over prior works.
TS14.6 MetaSyn: A Meta-Reinforcement Learning Framework with Multimodal Circuit Representation for Adaptive Logic Synthesis 08:55
Shukai Liu¹, Ruoyan Liao², Siyu Wang¹, Qimin Xu¹, Cailian Chen¹

¹ Shanghai Jiao Tong University ; ² Harbin Institute of Technology

Logic synthesis (LS) is a core stage in digital integrated circuit design, typically performed by applying optimized operators in Electronic Design Automation (EDA) tools. The quality of results (QoR) largely depends on the operator sequence. Traditional heuristic methods struggle with scalable circuit complexity, while existing learning-based approaches improve optimization but require retraining for each new circuit, limiting adaptability. To address this, we propose MetaSyn, a meta-reinforcement learning framework with multimodal circuit representation for adaptive logic synthesis. MetaSyn achieves both adaptability and high performance via three innovations: (1) A Model-Agnostic Meta-Learning (MAML) framework tailored for logic synthesis, where the inner and outer loops enable learning initialization parameters that allow rapid fine-tuning on unseen circuits with few samples. (2) A cooperative multi-stage RL environment (MSRL) with a multi-PPO architecture, dual-component action space, and delayed rewards, where three actor networks collaboratively optimize stage-specific operator sequences for higher performance and fast adaptability. (3) A general multimodal circuit representation (MCR) that fuses features from pre-trained DeepGate2 (AIGs), pre-trained Mamba (operator sequences), and an MLP (scalar states) via cross-attention and residual gating, forming a unified input for the policy network to enhance performance and generalization. Evaluations on the EPFL benchmark show that MetaSyn improves performance by up to 31.2% over compress2rs and by 20.8% over the state-of-the-art (SOTA) method, while offering significant advantages in fast adaptation to diverse circuits.
TS14.7 eLogic: An E-Graph-based Logic Rewriting Framework for Majority-Inverter Graphs 09:00
Rongliang Fu¹, Wei Xuan², Shuo Yin¹, Guangyu Hu³, Chen Chen⁴, Hongce Zhang⁵, Bei Yu¹, Tsung-Yi Ho¹

¹ The Chinese University of Hong Kong ; ² AI Chip Center for Emerging Smart Systems, The Hong Kong University of Science and Technology ; ³ The Hong Kong University of Science and Technology ; ⁴ The Hong Kong University of Science and Technology(Guangzhou) ; ⁵ Hong Kong University of Science and Technology (Guangzhou)

Majority-Inverter Graph (MIG) emerges as a promising data structure for logic optimization and synthesis, offering a more compact representation for logic functions compared to traditional AND/OR-Inverter graphs. Consequently, the MIG finds widespread application in digital circuit design, particularly in quantum circuits and superconducting adiabatic quantum-flux-parametron logic circuits. Currently, logic optimization techniques for MIG mainly fall into two categories: (i) logic rewriting with predefined more compact sub-structures and (ii) logic resubstitution with already existing logic in the Boolean network. However, the inherent complexity of MIG logic and the limitation imposed by the input scale of sub-structures significantly impact the performance of these methods. To address these challenges, this paper proposes eLogic, a novel depth-oriented MIG logic rewriting framework using e-graphs, to minimize the depth and size of MIG. The eLogic utilizes the e-graphs, a data structure for efficient computation with equalities between terms, to minimize the depth and size of the cone delimited by the cut. The experimental results on the EPFL benchmark demonstrate the effectiveness of eLogic. It is noteworthy that eLogic is open-sourced on https://github.com/Flians/eLogic.
TS14.8 Advancing LUT-based Threshold Logic Synthesis with Enhanced Area Estimation 09:05
Yu-Shan Lin¹, Yung-Chih Chen²

¹ National Taiwan University of Science and Technology ; ² National Taiwan University of Science and Technology; Arculus System Co. Ltd.

Threshold logic has regained research interest, leading to the development of design automation techniques. A recent lookup table (LUT)-based threshold logic synthesis method has shown promising results by leveraging the strengths of LUT mapping. However, its reliance on the disjoint support decomposition (DSD) manager for caching NPN-equivalent functions can lead to inaccurate area estimation, as it overlooks key properties of threshold functions. This misestimation degrades overall synthesis quality. To address this, we propose improvements that enhance area estimation by extending the DSD manager and correcting function complementation in the covering process. These enhancements allow the mapper to select lower-cost coverings more effectively. Experimental results show an average area reduction of 6.12% with cut size 6 and 6.54% with cut size 15, compared to the original LUT-based method.
TS14.9 UniCircuit: Multimodal Circuit Representation Learning with Anchor-Free Alignment 09:10
Jingxin Wang¹, Weikang Qian¹

¹ Shanghai Jiao Tong University

Electronic design automation requires unified circuit representations that jointly capture functionality, structure, and optimization behavior. Aligning heterogeneous circuit data (e.g., Verilog codes and AND-inverter graphs (AIGs)) remains challenging due to substantial semantic gaps. Existing contrastive approaches rely on carefully defined cross-modal anchors, which are often difficult to construct and limit alignment quality. We propose UniCircuit, an anchor-free multimodal framework that integrates Verilog, AIG, and synthesis flow representations. UniCircuit employs a singular value decomposition-based alignment mechanism to capture shared semantics without predefined anchors. Experimental results show that UniCircuit achieves 52.94% higher quality-of-result (QoR) prediction accuracy and 79.35% better cross-modal retrieval recall@5 compared to state-of-the-art methods.
TS14.10 Efficient Arithmetic on FPGA 09:11
Danila Gorodecky¹, Leonel Sousa²

¹ INESC-ID, Instituto Superior Tecnico ; ² INESC-ID, IST, Universidade de Lisboa

This paper presents an efficient methodology for FPGA arithmetic design based on Boolean function optimization. Focusing on constant multiplication, modular multiplication and reduction, and division by constants, the proposed approach achieves up to 20× LUT reduction and up to 50% delay improvement compared to Vivado-generated designs. Experimental results also demonstrate competitive area–delay trade-offs relative to FloPoCo, highlighting the effectiveness of the method for high-performance FPGA arithmetic implementations.

TS15 Efficient Hardware Architectures for AI applications 08:30 - 10:00 | Aida

Chair: Olivier Sentieys (Inria, University of Rennes, FR)

Co-Chair: Alessandro Savino (Politecnico di Torino, IT)

TS15.1 A2RT: Efficient Ray Tracing Accelerator with Approximate-Accurate Computing and Quantization 08:30
Zhiyuan Zhang¹, Zhihua Fan², Wenming Li², Yudong Mu², Yuhang Qiu², Zhen Wang², Xiaochun Ye², Xuejun An²

¹ Institute of Computing Technology (ICT), Chinese Academy of Sciences, University of Chinese Academy of Sciences ; ² Institute of Computing Technology, Chinese Academy of Sciences (ICT)

Ray tracing (RT) has revolutionized photorealistic rendering by simulating light transport, but existing methods face a trade-off between computational efficiency and rendering accuracy. To address this, we present A2RT, a software-hardware co-designed RT accelerator employing the end to end optimization of "quantization -> computation". On the software side, we introduce a customized data flow mechanism with type-specific quantization for bounding boxes, ray origins, and directions, and we organize BVH nodes into Group- and Sub-Nodes. At the hardware level, a heterogeneous RT engine allocates resources based on node criticality: accurate computing units handle Group-Nodes, while approximate units process Sub-Nodes. A custom INT-FLOAT approximate multiplier further accelerates the approximate units. Experimental results show that A2RT achieves 45.51% energy consumption and 2.29x speedup over RT Core, and consumes 81.79% of energy while delivering 1.57x performance improvement compared to state-of-the-art accelerators.
TS15.2 KAN-SAs: Efficient Acceleration of Kolmogorov-Arnold Networks on Systolic Arrays 08:35
Sohaib Errabii¹, Olivier Sentieys², Marcello Traiola³

¹ INRIA ; ² Inria, University of Rennes ; ³ Inria / IRISA

Kolmogorov-Arnold Networks (KANs) have garnered significant attention for their promise of improved parameter efficiency and explainability compared to traditional Deep Neural Networks (DNNs). KANs' key innovation lies in the use of learnable non-linear activation functions, which are parametrized as splines. Splines are expressed as a linear combination of basis functions (B-splines). B-splines prove particularly challenging to accelerate due to their recursive definition. Systolic Array (SA)-based architectures have shown great promise as DNN accelerators thanks to their energy efficiency and low latency. However, their suitability and efficiency in accelerating KANs have never been assessed. Thus, in this work, we explore the use of SA architecture to accelerate the KAN inference. We show that, while SAs can be used to accelerate part of the KAN inference, their utilization can be reduced to 30%. Hence, we propose KAN-SAs, a novel SA-based accelerator that leverages intrinsic properties of B-splines to enable efficient KAN inference. By including a non-recursive B-spline implementation and leveraging the intrinsic KAN sparsity, KAN-SAs enhances conventional SAs, enabling efficient KAN inference, in addition to conventional DNNs. KAN-SAs achieves up to 100% SA utilization and up to 50% clock cycles reduction compared to conventional SAs of equivalent area, as shown by hardware synthesis results on a 28nm FD-SOI technology. We also evaluate different configurations of the accelerator on various KAN applications, confirming the improved efficiency of KAN inference provided by KAN-SAs.
TS15.3 Scrooge: Accelerating Attention Inference in LLMs via Early Termination Mechanism 08:40
Gwangeun Byeon¹, Seongwook Kim², Taein Kim³, Jungmin Lee³, Seokin Hong³

¹ Sungkyunkwan University(SKKU) ; ² Sungkyunkwan university ; ³ Sungkyunkwan University

Large Language Models (LLMs) have demonstrated remarkable performance in natural language processing and are now widely adopted in diverse applications. However, their significant computation and memory cost severely limit their acceleration. In particular, the self-attention mechanism is a significant bottleneck, as it cannot exploit batch parallelism across prompts, and its memory traffic grows quadratically with sequence length. In this paper, we propose Scrooge, a novel hardware accelerator framework that leverages an attention early termination mechanism, designed to address the inefficiency of self-attention. The self-attention mechanism does not assign equal importance to all tokens. Instead, semantically important tokens consistently receive higher attention scores. Consequently, preserving sufficient attention for a subset of important tokens is often enough to maintain model accuracy, even without computing attention for all tokens. Our key insight is that once sufficient attention has been accumulated, further computation with the remaining tokens only increases complexity without improving accuracy. Scrooge leverages this insight to approximate the attention of the remaining tokens and terminates the attention computation dynamically once it has gathered sufficient attention. With this method, Scrooge reduces both latency and memory traffic while maintaining accuracy. Experimental results show that Scrooge achieves a 1.7× speedup and a 0.47× reduction in memory traffic with negligible accuracy loss.
TS15.4 FLICKER: A Fine-Grained Contribution-Aware Accelerator for Real-Time 3D Gaussian Splatting 08:45
Wenhui Ou¹, Zhuoyu Wu², Yipu Zhang³, Dongjun Wu¹, Freddy Hong¹, Chik Yue¹

¹ Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR ; ² School of IT, Monash University, Malaysia campus ; ³ Hong Kong University of Science and Technology

Recently, 3D Gaussian Splatting (3DGS) has become a mainstream rendering technique for its photorealistic quality and low latency. However, the need to process massive non-contributing Gaussian points makes it struggle on resource-limited edge computing platforms and limits its use in next-gen AR/VR devices. A contribution-based prior skipping strategy is effective in alleviating this inefficiency, but the associated contribution testing workload becomes prohibitive when it is further applied to the edge. In this paper, we present FLICKER, a contribution-aware 3DGS accelerator that leverages a hardware–software co-design framework, including adaptive leader pixels, pixel-rectangle grouping, hierarchical Gaussian testing, and mixed-precision architecture, to achieve near-pixel-level, contribution-driven rendering with minimal overhead. Experimental results show that our design achieves up to 1.5× speedup, 2.6× energy efficiency improvement, and 14% area reduction over a state-of-the-art accelerator. Meanwhile, it also achieves 19.8× speedup and 26.7× energy efficiency compared with a common edge GPU.
TS15.5 RAPID: Accelerating Point Cloud Diffusion Models via Space-Aware Mix-Precision Quantization 08:50
Qichu Sun¹, Yanan Zhu², Linxi Lu², Haishuang Fan³, Jingya Wu⁴, Huawei Li⁵, Xiaowei Li⁶, Guihai Yan⁷

¹ State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences ; ² State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences；University of Chinese Academy of Sciences ; ³ State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences;University of Chinese Academy of Sciences ; ⁴ State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences ; ⁵ Institute of Computing Technology, Chinese Academy of Sciences ; ⁶ ICT, Chinese Academy of Sciences ; ⁷ State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences; YUSUR Technology Co., Ltd.

Point cloud diffusion models, as an emerging 3D generation method, hold broad prospects in 3D modeling, AR/VR, and so on. However, their reliance on costly full-precision neural network computations during extended denoising process limits their practical application. To address this challenge, we propose RAPID, an accelerator co-designed with a space-aware quantization method. First, RAPID uses K-means to partition points into groups and computes scaling factors in each, mitigating accuracy issues caused by uneven distribution. Second, it employs a mixed-precision quantization scheme that uses low precision for internal point groups and high precision for detail-rich edge groups, ensuring generation quality while minimizing bit-width. Third, it reuses computation results for groups with little change between timesteps, reducing redundant calculations. Moreover, RAPID's hardware features a mixed-precision PE array for efficient computations at various bit-widths, and a filter for dynamic bit-width allocation and result reuse. Evaluations show that, compared to the NVIDIA RTX A5000 GPU and state-of-the-art accelerators, RAPID achieves average speedups of 9.22 times, 4.66 times, 3.69 times, and 3.01 times, and energy savings of 61.74 times, 4.30 times, 3.94 times, and 2.76 times, with negligible accuracy loss.
TS15.6 AO-BFP: An Adaptive Mixed-Precision and Outlier-Aware Block Floating-Point Accelerator for Large Language Model Inference 08:55
Yifan Wang¹, Zetao Guo¹, Wendi Sun¹, Wenhao Sun¹, Qiyan Fang¹, Song Chen¹, Yi Kang¹

¹ University of Science and Technology of China

Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP) tasks, but their deployment is severely constrained by intensive computation and memory costs. Block Floating-Point (BFP) extends the dynamic range beyond INT with shared exponents, while reducing memory and alignment overhead compared to floating-point formats. However, when bit-widths are further reduced, BFP becomes sensitive to outliers; existing mixed-precision BFP methods largely rely on heuristic settings of mantissa width and block size, and the induced bit-level sparsity has yet to be systematically leveraged in hardware. In this paper, we propose AO-BFP, an adaptive BFP framework for LLM inference. At the algorithm level, we propose an adaptive outlier exponent mapping mechanism combined with mixed-precision exploration driven by layer-wise sensitivity analysis. At the hardware level, we design a reconfigurable bit-serial accelerator with a unified datapath that efficiently leverages BFP-induced bit sparsity. Compared with prior LLM accelerators such as ANT, OliVe, and BitMoD, AO-BFP achieves superior performance while preserving model accuracy, delivering speedups of 1.61×, 1.39×, and 1.11×, respectively.
TS15.7 Breaking the BRAM Wall: Scalable Vina FPGA Acceleration via Distributed Grid Storage and Cross-Board Long-Ring Pipelines 09:00
Ankun Tian¹, Shidi Tang¹, Ruiqi Chen², Ming Ling³

¹ Southeast University ; ² Vrije Universiteit Brussel ; ³ School of Microelectronics, Southeast University

AutoDock Vina(Vina), a gold standard for molecular docking, is hampered by computational expense. Previous FPGA hardware accelerators were fundamentally constrained by an on-chip memory bottleneck, where storing large, pre-computed energy grids takes most of the BRAM resource. This leads to imbalanced resource utilization, as the exhausted on-chip memory makes it impossible to further increase the degree of intra-node parallelism by instantiating more processing units. This paper introduces a novel, scalable multi-FPGA architecture that systematically removes this limitation. Our architecture's innovation is a synergistic combination of three mechanisms. First, Distributed Grid Storage partitions the energy grid across all nodes to break the BRAM bottleneck. Second, a Cross-Board Long-Ring Pipeline creates a high-throughput dataflow for distributed energy calculations. Third, a dynamic intra-node scheduler unlocks massive fine-grained parallelism within each node. Together, these mechanisms create a super-linear performance scaling, as the throughput of each single board is enhanced by the growing cluster size. Implemented on a three-ZCU102 FPGA system and without sacrificing accuracy, our single-board normalized performance is 7.6× faster than the previous-generation Vina-FPGA and 1.95× faster than the state-of-the-art Vina-FPGA-Cluster. Critically, the architecture demonstrates super-linear performance scaling: the three-board system achieves a 3.7× speedup over a single node, outperform the linear 3× speedup.
TS15.8 AURORA - AUtomated 8T SRAM Wired-OR Logic Array for Boolean-Based Machine Learning 09:05
Komal Krishnamurthy¹, Marcos Sartori¹, Shengyu Duan¹, Alex Yakovlev¹, Rishad Shafik¹

¹ Newcastle University

In-memory computing (IMC) addresses the data movement challenges of traditional von Neumann architectures. Among the available memory technologies, SRAMs are fast, robust, and CMOS compatible, making SRAM-based digital in-memory computing suitable for Boolean machine learning applications. However, most SRAM-based digital IMC architectures require specialized addressing strategies and demonstrate wired operations at schematic level. It is essential to analyze operand scalability for wired operations under layout-dependent parasitic effects. Moreover, generating custom memory macros is complex: proprietary or open-source memory compilers target standard memories and often require information obtained through tedious methods. This paper proposes AURORA - an automated 8T SRAM IMC array that performs multicast wired-OR across the array. AURORA primarily explores the impact of concurrent row switching on varying memory sizes through post-layout analysis. It is achieved through a custom memory compiler written in SKILL. The compiler automates arbitrarily sized memory array generation, which are thoroughly analyzed for functionality, performance, area, and power consumption. The array's wired-OR functionality aligns naturally with the Boolean-based inference and is demonstrated for the Tsetlin Machine (TM) as a case study. The proposed architecture offers two operation modes: i) memory mode, for standard read/write operation; ii) wired-OR/inference mode. AURORA achieves 10^3x higher throughput and energy efficiency than similar architectures. It is 89 times faster, consumes 772x less energy per data point and provides 37.5x more TOPS/W than memristor-based TM implementations. It also has similar latency but uses 22x less energy and offers 22.3x higher TOPS/W compared to digitally synthesized TM implementations.
TS15.9 DyGen: A Constant-Time Kernel Generator for Dynamic-Shape Neural Networks 09:10
yuhan kang¹, Wenrui Zhang², dong chen², Yang Shi³, Jianchao Yang⁴, Zeyu Xue¹, jing feng¹, mei wen⁴

¹ NUDT ; ² Huawei Technologies Co., Ltd. ; ³ National University of Defense Technology ; ⁴ Colledge of Computer, National University of Defense Technology

In recent years, dynamic-shape neural networks have been widely adopted in intelligent applications, such as Mixture-of-Experts based large language models and computer vision tasks. However, in dynamic scenarios, operator shapes are determined at runtime. This leads to prohibitively expensive compilation times for existing static compilers, as they must search across a vast optimization space to identify the best configuration. To address the need for efficient optimization of dynamic-shape neural networks, we present DyGen (Dynamic-shape Kernel Generator)—a lightweight, two-stage compiler plug-in on GPU platforms. In the offline stage, DyGen employs deliberately crafted pruning rules to construct a compact candidate configuration set for the target hardware, then select the configuration of the high-performance kernel to train a configuration generation model. During the online stage, dynamic operator information is directly fed into the generator, which can quickly produce efficient kernel configurations without the need for costly search. Compared to state-of-the-art tensor compilers, DyGen improves inference performance by an average of 36\%, while significantly reducing generation overhead from 9 seconds to 0.3 seconds.
TS15.10 Microscaling-Stochastic Computing based Systolic Arrays for Energy Efficient Deep Neural Network Inference 09:15
Mohammad Hassani Sadi¹, Bilal Hammoud¹, Norbert Wehn²

¹ University of Kaiserslautern-Landau ; ² University of Kaiserslautern

Deep neural networks (DNNs) require increasingly high compute and memory resources. Microscaling (MX) data formats improve energy efficiency and preserve accuracy under aggressive bit-width reduction, but further gains from continued bit-width reduction remain challenging. This work proposes a hybrid computation scheme that integrates MX data formats with stochastic computing (SC) to improve the energy efficiency of DNN inference under constrained bit widths. Model parameters are stored in MX format, while multiplications and accumulations are performed in the SC domain. MX reduces memory footprint, while SC improves compute energy efficiency. To address the latency and accuracy challenges of SC, we employ a parallel bitstream-generation scheme and an encoding strategy that reduces random fluctuation error. Experimental results demonstrate up to a 2× improvement in energy efficiency while maintaining inference accuracy within 1–2% of an FP32 baseline.
TS15.11 CoMix-D: A Low-Cost, RNG-Free Decorrelator via Correlation Mixing for Stochastic Computing 09:16
Yexian Lin¹, Chunyan Wu¹, Kuncai Zhong¹, Weikang Qian²

¹ Hunan University ; ² Shanghai Jiao Tong University

Stochastic computing, an unconventional computing paradigm, often struggles with the costly random number generator (RNG)-based decorrelators. To solve this issue, we propose CoMix-D, a real-time solution that needs no RNGs. It uses a deterministic mixing architecture built from LiteSync, LiteDesync, and BitAggregator. Compared to state-of-the-art methods, CoMix-D achieves substantial savings of 80.1% in area and 59.9% in power without compromising accuracy.

TS16 Spintronic Memory-centric Design Architectures 08:30 - 10:00 | Rigoletto

Chair: Mottaqiallah Taouil (TU Delft, NL)

Co-Chair: Arnaud Virazel (LIRMM, Université de Montpellier, FR)

TS16.1 Non-volatile Spintronic Flip-Flops with Checkpoint Preservation Supported in RISC-V Platform 08:30
Jiongzhe Su¹, Mingtao Chen¹, Zhanpeng Qiu¹, Bo Liu¹, Hao Cai¹

¹ Southeast University

Due to ambient energy's inherent instability, intermittent computing is essential for task completion. This work comprehensively explores the spintronic flip-flop implementation in the open-source RISC-V platform. Magnetic tunnel junction (MTJ) has great potential for non-volatile flip-flop (NV-FF) implementation because of its high density, low read and write energy consumption, and compatibility with CMOS process. To the best of the authors' knowledge, the checkpoint preservation is firstly supported in this work. The proposed non-volatile differential sampling latch (NV-DSL) achieves 7.39 fJ/bit data transfer energy consumption. The phased write strategy reduces write energy by 24.3%. A generalized NV-FF design methodology is further established, achieving a 68.88% area reduction. The power consumption of proposed non-volatile RISC-V processor is reduced by nearly 75%. When performing atomic tasks, the energy consumption and latency are reduced by 61.4% and 43.87%, respectively, compared with the cache storage scheme.
TS16.2 A Low-Power Bayesian Head Using SOT-MRAM Arrays for Uncertainty-Aware Binary Neural Networks 08:35
Joao Quintino Palhares¹, Jonathan Miquel², Aymen Romdhane³, Bruno Lovison Franco⁴, Kamel-eddine Harabi⁵, Kevin Garello⁶, Louis Hutin⁷

¹ Spintec (CEA) ; ² INP, Université Grenoble Alpes (UGA) ; ³ SPINTEC (CEA-Grenoble) ; ⁴ LIRMM, Université de Montpellier ; ⁵ CEA-LETI ; ⁶ Spintec (CEA - Grenoble) ; ⁷ CEA - LETI

This work presents a novel low-power, mixed-signal computing in memory (CIM) architecture for Bayesian inspired inference, targeting edge AI applications requiring energy-efficient uncertainty estimation. Our system integrates a deterministic Binary Neural Network (BNN) with a Bayesian head module implemented using multi-pillar (MP) Spin-Orbit Torque Magnetic RAM (SOT-MRAM) based arrays. The Bayesian head perturbs the output popcount of the BNN by injecting configurable stochastic counts, enabling uncertainty quantification in classification tasks. These perturbations are configurable in ‘flavor' through a tunable dropout rate and the number of MP cells. A VCO-based ADC converts analog resistive summations into digital counts, which are then combined with the deterministic BNN output. On MNIST and CIFAR-10, the proposed system achieves classification accuracy comparable to state-of-the-art Bayesian approaches while consuming only 19 µW. It achieves a favorable energy efficiency of 53 TOPs/W (18.9 fJ/OPS) for 3-bits and 110 TOPs/W for 2-bits perturbation precision. Uncertainty estimation is validated through controlled domain shifts (e.g., tilted images), showing robust entropy and variance evolution. Notably, the proposed uncertainty estimation requires only 25 perturbation runs, resulting in a total energy cost of just 454 fJ. At this overhead, the Bayesian-inspired model improves reliability by 34.29% compared to the baseline on CIFAR-10. This low-power hybrid analog-digital architecture offers a promising solution for edge applications with embedded confidence metrics.
TS16.3 A Low Power and High Reliability Nonvolatile SRAM Using In-Plane VGSOT-MRAM with Pre-charge Restore Scheme 08:40
Xiaoyang Xu¹, Chenyi Wang¹, Zhongzhen Tong¹, Mingche Li¹, Weimeng Zhao¹, Zhongkui Zhang¹, Yaling Wang¹, Chao Wang¹, Zhaohao Wang¹

¹ Beihang University

Conventional magnetic nonvolatile-static random access memory (MNV-SRAM) suffer from large write current, which leads to low area and energy efficiency, severely limiting their development and application. This paper proposes a 10T-2M NV-SRAM cell based on the in-plane Voltage-gated spin orbit torque magnetic tunnel junction (VGSOT-MTJ), which enables field-free deterministic magnetization switching. By utilizing the voltage-controlled magnetic anisotropy (VCMA) effect to assist the store operation, the store current is reduced, achieving the smallest SRAM size among designs of the prior works. On the other hand, existing NV-SRAM restore schemes exhibit a substantial deterioration in restore error ratio (RSER) with increasing MTJ resistance. Targeting the high resistance characteristics of VGSOT-MTJ, we innovatively proposes a pre-charge restore scheme with sensitive transistor isolation. Simulation results demonstrate that the proposed design achieves the lowest read and write energy, with restore energy reduced by 1.59x and 2.48x compared to other in-plane MTJ-based designs. And the proposed restore scheme significantly improves restore reliability with over 98.7% RSER enhancement, and shows superior robustness across different MTJ resistances and TMR conditions.
TS16.4 Equivalent-0ns-Replacement Self-Aware-Access LLC on Dual-Port SOT-MRAM by Sense-While-Replace 08:45
Keyang Zhang¹, Quanhai Zhu², Zhenghan Fang², Shuyu Wang², Hao Cai²

¹ School of Integrated Circuits, Southeast University ; ² Southeast University

Last-Level Cache (LLC) is increasingly required to be energy-efficient and area-saving. Emerging Non-volatile memory (NVM), such as Magnetic-resistive Random Access Memory (MRAM), present potential solutions for LLC as its ultra-low leakage and area. However, the high replacement-latency caused by its high write-latency and power consumption hinders MRAM in LLC applications. Thus, this paper proposes a novel Sense-While-Replace (SWR) strategy for dual-port SOT-MRAM, which liberates the conflict between reading and writing to conceal the impact of high write-latency on system performance. Furthermore, Self-aware access circuits are proposed, which accelerate reading and obtain utmost writing-energy saving. Under 40-nm CMOS technology, the 4Kb Macro achieves <3ns@32bits read and <75% energy-saving. Most crucially, SWR supports CPU continuously read whereas preserved from replacement latency, which improves performance by up to 8% even compared to SRAM.
TS16.5 Soft-Error Resilient MRAM-OTP BCAM for DDR4 STT-MRAM Redundancy Management 08:50
HAORAN Du¹, Hongjin Zhu¹, Zhenghan Fang¹, Shuyu Wang¹, Hao Cai¹

¹ Southeast University

Memory systems operating in high-radiation environments require robust protection against single-event effects (SEE) damage. While STT-MRAM offers inherent advantages due to its spin-based storage, conventional redundancy repair architectures remain vulnerable due to separated storage and configuration circuits, slow boot performance, and radiation-induced errors in long signal paths. This paper proposes a novel radiation-hardened one-time programmable (OTP) content-addressable memory (CAM) based on magnetic tunnel junctions (MTJs) for efficient column redundancy in STT-MRAM macros. The design incorporates a radiation-hardened-by-design (RHBD) CAM array with built-in self-repair (BISR), featuring complementary OTP MTJ bitcells enabling parallel programming and disturbance-free matching, a soft-error resilient array with dual-node hardened latches and dual match-line sensing, and a DDR4-compatible repair mechanism supporting TMR-Latch-based fast initialization and energy-efficient search with inter-loop termination. The proposed system significantly improves wake-up speed to less than two clock cycles, and reduces power consumption to less than 12fJ, offering a viable solution for 37MeV radiation-tolerant memory systems.
TS16.6 MIRAGE: MRAM-Based Near ADC-Less Compute-In-Memory Macro for Deep Learning Acceleration 08:55
Mainakh Mukherjee¹, Ayan Pranta¹, Utkarsh Saxena¹, Anushka Mukherjee¹, Deepika Sharma¹, Gaurav Kumar K¹, Kaushik Roy¹

¹ Purdue University

Non-volatile memory (NVM) based Compute-in-Memory (CiM) architectures have emerged as a promising compute primitive for accelerating deep neural networks (DNNs) by performing in-situ matrix–vector multiplications (MVMs). Among various NVMs, STT-MRAM (Spin Transfer Torque based Magnetoresistive Random Access Memory) shows potential due to its high endurance, low energy consumption and high density. However, existing STT-MRAM CiM designs typically rely on multi-bit analog-to-digital converters (ADCs) at the peripherals to digitize accumulated bit-line currents. While enabling high-precision computation, ADCs add substantial energy, latency, and area overheads. To alleviate such problems, we propose a system-technology co-design approach to a Near ADC-Less CiM design with ternary partial-sums called MIRAGE. The accuracy is maintained by considering hardware level partial sum quantization in the training loop. Specifically, we develop an STT-MRAM based CiM macro which features differential bitcells and an adaptive threshold sensing that is amenable to the requirements posed by ternary partial-sum quantization. We do a thorough energy, area, latency, and sense margin analysis along with robust benchmarking against conventional 1T-1MTJ (1 transistor-1 Magnetoresistive Tunnel Junction) based MRAM CiM. The proposed CiM macro occupies ~20% less area, consumes 1.8x less MVM energy and shows 5x better latency with improved distinguishability compared to 1T-1MTJ CiM macro while achieving better accuracy.
TS16.7 An Effective SNN Macro with Real-Time STDP and Dynamic LIF Model Based on Thermally Interplayed Spin-Orbit Torque MTJ 09:00
Changyu Li¹, Linjun Jiang¹, Liangchen Li¹, Dehang Zhu¹, Junda Zhao¹, Hongxi LIU², Wang Kang¹, Wenlong Cai¹, He Zhang¹, Weisheng Zhao¹

¹ Beihang University ; ² Truth Memory Corporation

Abstract—Spiking neural networks (SNNs) have emerged as a promising paradigm for effective event-driven computation. However, CMOS-based SNN designs are limited by power consumption and complexity, while nonvolatile memory (NVM)-based SNN designs often lack biological characteristics and require active capacitive circuits to emulate neuronal dynamics. In this paper, we propose a thermally interplayed spin-orbit torque magnetic tunnel junction (TI-MTJ) macro that integrates core SNN functionalities. Our neuron array autonomously achieves leaky integrate-and-fire (LIF) model within the TI-MTJ device, thus improving power efficiency and simplifying circuit structure. Additionally, the proposed synaptic array provides adaptive in-situ responses based on a simplified spike-timing-dependent plasticity (STDP) rule. To enhance biological plausibility, our macro incorporates real-time spike monitoring and inhibition mechanisms. A comprehensive device-circuit-algorithm co-optimization framework validates the high performance of the TI-MTJ macro, achieving a synaptic energy consumption of 6.07fJ per spike, an inference accuracy of 97.76% on the MNIST dataset, and an energy efficiency of 22.8TOPS/W.
TS16.8 Antiferromagnetic Tunnel Junctions (AFMTJs) for In-Memory Computing: Modeling and Case Study 09:05
Yousuf Choudhary¹, Tosiron Adegbija¹

¹ University of Arizona

Antiferromagnetic Tunnel Junctions (AFMTJs) enable picosecond switching and femtojoule writes through ultrafast sublattice dynamics. We present the first end-to-end AFMTJ simulation framework integrating multi-sublattice Landau-Lifshitz-Gilbert (LLG) dynamics with circuit-level modeling. SPICE-based simulations show that AFMTJs achieve ∼8× lower write latency and ∼9× lower write energy than conventional MTJs. When integrated into an in-memory computing architecture, AFMTJs deliver 17.5× average speedup and nearly 20× energy savings versus a CPU baseline—significantly outperforming MTJ-based IMC. These results establish AFMTJs as a compelling primitive for scalable, low-power computing.
TS16.9 High-Performance and High-Density NAND-Like SOT-MRAM for FinFET Technology Nodes 09:06
Chao Wang¹, Xianzeng Guo¹, Luman Xiang¹, Zhaohao Wang¹, Weisheng Zhao¹

¹ Beihang University

This paper proposes a comprehensive optimization framework for NAND-like spintronics memory (NAND-SPIN) in advanced FinFET technology nodes. At bit-cell structure level, we propose a NAND-SPIN-GND design which is configured with a grounded bit line (BL) to minimize the parasitic resistance in both read and write paths, thereby decreasing read latency by 35.0% and write energy by 27.3%. At device and layout level, a short-circuiting bottom electrode (SBE) design is proposed, which shorts non-contributing spin-orbit torque (SOT) segments by the BEs, reducing read latency by up to 41.4% and write energy by 55.8%. In addition, a compact capacitance symmetric source line (SL)-type reference scheme is introduced to address the inherent capacitance asymmetry in conventional SL-type reference scheme, resulting in a 48.6% reduction in read latency compared to the conventional word line (WL)-type reference scheme.

BPA03 Computing at the Edge for Efficiency and Security 08:30 - 10:00 | Nabucco

Chair: Emanuele Valea (CEA List, FR)

Co-Chair: Jaan Raik (Tallinn University of Technology, EE)

BPA03.1 SecIC3: Customizing IC3 for Hardware Security Verification 08:30
Qinhan Tan¹, Akash Gaonkar¹, Yu-Wei Fan¹, Aarti Gupta¹, Sharad Malik¹

¹ Princeton University

Recent years have seen significant advances in using formal verification to check hardware security properties. Of particular practical interest are checking confidentiality and integrity of secrets, by checking that there is no information flow between the secrets and observable outputs. A standard method for checking information flow is to translate the corresponding non-interference hyperproperty into a safety property on a *self-composition* of the design, which has two copies of the design composed together. Although prior efforts have aimed to reduce the size of the self-composed design, there are no state-of-the-art model checkers that exploit their special structure for hardware security verification. In this paper, we propose SecIC3, a hardware model checking algorithm based on IC3 that is customized to exploit this self-composition structure. SecIC3 utilizes this structure in two complementary techniques: *symmetric state exploration* and *addition of equivalence predicates*. We implement SecIC3 on top of two open-source IC3 implementations and evaluate it on a non-interference checking benchmark consisting of 10 designs. The experiment results show that SecIC3 significantly reduces the time for finding security proofs, with up to 49.3x proof speedup compared to baseline implementations.
BPA03.2 INSPIRE: In-Sensor Compressed Weight Retrieval for Enhancing ViT Efficiency at Edge 08:50
Sabbir Ahmed¹, Deniz Najafi², Mohaiminul Al Nahian¹, Navid Khoshavi³, Abdullah Al Arafat⁴, Mamshad Nayeem Rizve⁵, Mahdi Nikdast⁶, Adnan Siraj Rakin⁷, Shaahin Angizi²

¹ Binghamton University (SUNY) ; ² New Jersey Institute of Technology ; ³ AMD ; ⁴ Florida International University ; ⁵ Adobe ; ⁶ Colorado State University ; ⁷ Binghamton University

Deploying Vision Transformer (ViT) models on edge devices poses significant challenges due to the high bandwidth, energy demands, and latency associated with transmitting large weight parameter sets to the sensing unit, along with limited on-chip memory resources, which are often insufficient for storing these parameters. To address these constraints, we present a software-hardware co-design framework that incorporates a novel in-sensor Compressed Weight Retrieval mechanism within an intelligent vision sensor. This framework offers two key contributions. First, we propose an innovative hardware-friendly weight compression algorithm that substantially reduces bandwidth and power consumption by optimizing on-chip memory usage for storing weight parameters. Second, we leverage the exceptional efficiency of Silicon Photonic (SiPh) devices and design a novel in-sensor accelerator called INSPIRE for the first time to perform in-sensor retrieval of the compressed weights and parallel fine-grained convolution operations next to the pixel array, enabling low-power adaptable ViT inference on resource-constrained edge platforms. Our extensive simulation results show that INSPIRE can remarkably reduce the memory footprint of ViT results with favorable accuracy. Besides, INSPIRE significantly reduces the bandwidth and power requirements associated with storing weight parameters in on-chip memory. INSPIRE achieves up to 245.4 Kilo FPS/W and reduces the data transfer energy by a factor of ~11x on average compared with 4-bit quantized ViTs.
BPA03.3 LoRA-Edge: Tensor-Train–Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices 09:10
Hyunseok Kwak¹, Kyeongwon Lee¹, Jae-Jin Lee², Woojoo Lee¹

¹ Chung-Ang University ; ² ETRI

On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional kernels, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8 x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

FS03 Focus Session: Challenges and Perspectives in Advanced Packaging: Design, Reliability, and Security of 3D and Chiplet-Based Systems 08:30 - 10:00 | Figaro

AI workloads are driving exceptional demand for performance and energy efficiency, forcing semiconductor inno- vation to advance along two major directions simultaneously. On the device roadmap, the transition from FinFETs to gate- all-around nanosheet FETs and, subsequently, monolithic 3D Complementary FETs (CFETs) is enabling scaling toward the 2nm era and beyond while targeting aggressive logic density. In parallel, advanced packaging, spanning 2.5D integration on silicon interposers, true 3D stacking, and hybrid 5.5D assemblies, is becoming essential to deliver ultra-high bandwidth, low-energy die-to-die connectivity required by rapidly growing AI model sizes and the resulting memory-wall bottlenecks. This focus session discusses the opportunities and challenges of this co-evolution, with emphasis on system-technology co-optimization and the inevitable need for multiphysics analysis across electrical, thermal, mechanical, and reliability domains. We highlight how reliability and security concerns are increasingly shaping architectural and packaging choices, and we discuss the role of deep learning as a practical enabler for faster simulation and design-space exploration under rising complexity.

Chair: Hussam Amrouch (Technical University of Munich, DE)

Organizers:

Dragomir Milojevic (Universite ? libre de Bruxelles and IMEC, BE)

FS03.1 AI-Driven Reliability Estimation from Beyond-3nm Transistors to Chiplet-Based Architectures 08:30
Hussam Amrouch¹

¹ Technical University of Munich (TUM)

The relentless demand for higher performance in artificial intelligence (AI) systems is pushing current semiconductor technologies and scaling strategies to their limits. Advanced packaging has emerged as a promising path forward, enabling multiple dies to be interconnected in 2.5D or 3D configurations. Such heterogeneous integration opens new opportunities for optimizing performance, cost, and energy efficiency. However, it also introduces critical reliability challenges. The confined geometries of advanced packaging architectures exacerbate thermal issues, leading to elevated temperatures that accelerate device aging and degradation mechanisms. In this talk, we will demonstrate how neural networks can be leveraged to accelerate reliability analysis across multiple abstraction levels—from transistors to full systems—in the era of advanced packaging. We will first present techniques for accurate thermal analysis of chiplet-based architectures using multi-physics simulations, and show how deep neural networks (DNNs) can dramatically accelerate this process to enable fast yet accurate design exploration. We will then discuss how machine learning can be employed to predict and accelerate aging analysis, starting from individual transistor degradation up to complex circuit blocks and processor-level reliability estimation.The talk will highlight recent advances in technology nodes, focusing particularly on 3 nm nanosheet transistors and the next generation of beyond-2 nm devices, including the upcoming complementary FETs (CFETs) technology.
FS03.2 System Technology Co-Optimization in 5.5D 08:53
Dragomir Milojevic¹

¹ IMEC

Advanced packaging enables implementation of 5.5D-ICs, where both lateral and vertical Die-to-Die connectivity are combined to link a flexible number of dies within a single IC package. These dies may originate from different CMOS process technologies. While migrating existing system architectures onto such technology platforms can bring advantages—and is already practiced in industry—the full potential of 5.5D integration emerges only with a system-level architectural redesign. This design philosophy, known as System Technology Co-Optimization (STCO), enables exploration of design points that would be impractical or unattainable in conventional 2D-ICs due to power, performance, or area limitations. In this talk, we will highlight the advantages of adopting STCO in hardware accelerator design and present a quantitative analysis of trade-offs at both the application (workload) level and the physical design level.
FS03.3 Security and Trust in 3D and Chiplet-based Designs 09:15
Giorgio Di Natale¹

¹ TIMA - CNRS

As 3D integration and chiplet-based design become increasingly adopted in high-performance and heterogeneous systems, security and trust issues are emerging as critical challenges. While Known Good Dies (KGDs) are assumed to be functionally verified before integration, practical deployment often requires additional post-distribution steps such as tuning, programming of secure elements, and final testing after stacking. These operations cannot rely on traditional debug and test infrastructures without introducing significant security risks, including potential exposure of IP and secret data. Moreover, the reduced physical accessibility and multi-vendor nature of chiplet-based assemblies increase the difficulty of detecting threats such as hardware Trojans and tampering during integration. This talk will explore these challenges, analyze current limitations in secure test architectures, and outline emerging perspectives and research directions to enhance trust, protect assets, and support secure in-field test and runtime monitoring in 3D-integrated systems.
FS03.4 Advances in Multiphysics Simulation for Multi-die Systems 09:37
Jerome Toublanc¹

¹ ANSYS

As electronic systems advance toward 3D integration, stacked dies, heterogeneous chiplet architectures, and advanced packaging, the coupled electrical, thermal, and mechanical interactions become increasingly complex. Traditional design and verification flows, based on simplified or isolated physics models, are no longer sufficient to confirm performance accurately, especially as reduced physical accessibility in 3D ICs limits direct testing and measurement. This talk will present advances in multiphysics simulation methodologies that capture electrical, thermal, mechanical, and electromagnetic effects across design implementation phases, from pre-layout to SignOff. Emphasis will be placed on AI-assisted strategies, co-simulation frameworks for predictive behavior analysis, and uncertainty quantification. Multiphysics simulation is essential not only for integrity and reliability, but also for hardware security. These innovations enable more accurate, scalable, and predictive design environments, supporting the robust optimization of next-generation 3D and chiplet-based systems.

SD01 Generative AI in Practice: Tools, Workflows, Education, and Research 08:30 - 10:00 | Auditorium

Chair: Natesan Venkateswaran (IBM, US)

Organizers:

Marcello Traiola (Inria, France)
Cristiana Bolchini (Politecnico di Milano, IT)

SD01.1 Special Day - A Design Blueprint for Scalable Multi-Agent Architectures in Complex EDA Workflows 08:30
Valerio Tenace¹, Pierre-Emmanuel Gaillardon²

¹ PrimisAI ; ² University of Utah

Electronic Design Automation (EDA) workflows involve complex, tightly coupled tools and artifacts that require high reliability and traceability. Recent advances in Large Language Models (LLMs) have opened new avenues for AI-driven automation through Multi-Agent Systems (MASs), which can decompose complex tasks into manageable subtasks. However, deploying MASs in EDA remains challenging due to weak coordination, unstructured communication, and limited reproducibility. In this paper, we propose a design blueprint for scalable, LLM-based, MASs tailored to EDA workflows, emphasizing hierarchical orchestration, explicit task interfaces, tool-grounded execution, structured communication, modular memory management, and observability with recovery paths. We also introduce Nexus, an open-source Software Development Kit (SDK) that implements these principles, thus enabling low-code workflow specification and robust agent interactions. We validate our approach on open-source benchmarks, achieving 100% functional accuracy on RTL generation (VerilogEval-Human), up to 98.78% functional pass rate on HumanEval, and timing closure with 26.64% average LUT reduction and almost 30% lower total power on VTR designs.
SD01.2 Special Day – GUIDE: GenAI Units In Digital Design Education 09:00
Weihua Xiao¹, Jason Blocklove¹, Matthew DeLorenzo², Johann Knechtel³, Ozgur Sinanoglu³, Kanad Basu⁴, Jeyavijayan Rajendran², Siddharth Garg¹, Ramesh Karri⁵

¹ New York University ; ² Texas A&M University ; ³ New York University Abu Dhabi ; ⁴ RPI ; ⁵ NYU

GenAI Units In Digital Design Education (GUIDE) is an open courseware repository with runnable Google Colab labs and other materials. We describe the repository's architecture and educational approach based on standardized teaching units comprising slides, short videos, runnable labs, and related papers. This organization enables consistency for both the students' learning experience and the reuse and grading by instructors. We demonstrate GUIDE in practice with three representative units: VeriThoughts for reasoning and formal-verification-backed RTL generation, enhanced LLM-aided testbench generation, and LLMPirate for IP Piracy. We also provide details for four example course instances (GUIDE4ChipDesign, Build your ASIC, GUIDE4HardwareSecurity, and Hardware Design) that assemble GUIDE units into full semester offerings, learning outcomes, and capstone projects, all based on proven materials. For example, the GUIDE4HardwareSecurity course includes a project on LLM-aided hardware Trojan insertion that has been successfully deployed in the classroom and in Cybersecurity Games and Conference (CSAW), a student competition and academic conference for cybersecurity. We also organized an NYU Cognichip Hackathon, engaging students across 24 international teams in AI-assisted RTL design workflows. The GUIDE repository is open for contributions and available at: https://github.com/FCHXWH823/LLM4ChipDesign.
SD01.3 AI for Research: Practical Approaches, Real Benefits, and Hidden Risks 09:30
Todd Austin¹

¹ University of Michigan

W04 Rapid Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications 08:30 - 12:30 | Bohème

TS17 Emerging Threats and Countermeasures in Hardware Security 11:00 - 12:30 | Rigoletto

Chair: Francesco Regazzoni (University of Amsterdam and Università della Svizzera italiana, CH)

Co-Chair: Alessandro Savino (Politecnico di Torino, IT)

TS17.2 When Faults Don't Vanish: Persistent Fault Injection and Key Recovery on MRAM-Backed AES 11:05
Brojogopal Sapui¹, Priyanjana Pal², Mehdi Tahoori²

¹ Karlsruhe Institute of Technology, Germany ; ² Karlsruhe Institute of Technology

Spin-Transfer Torque MRAM (STT-MRAM) is emerging as a promising non-volatile memory for secure storage in embedded and cryptographic systems due to its endurance and energy efficiency. However, its magnetic non-volatility introduces unique fault security implications that differ fundamentally from those in volatile memories such as SRAM and DRAM. While transient fault attacks on volatile memories require repeated injections and often suffer from low reliability, faults induced during MRAM write operations can persist across power cycles, posing a stronger and less explored threat. In this work, we address the challenge of deterministically inducing and exploiting such persistent faults. We present a practical fault injection framework based on ChipWhisperer-Pro, where precisely timed voltage glitches are aligned with MRAM write cycles using FPGA- generated trigger signals. By isolating the MRAM power node from the FPGA core, we ensure targeted corruption of memory contents without destabilizing the system. We demonstrate the attack by storing the AES key schedule in MRAM and inducing persistent faults in selected round keys. These faults propagate deterministically through subsequent rounds, enabling efficient Differential and Statistical Persistent Fault Analysis. Our experi- ments show that AES-128 key recovery becomes feasible with as few as 12–17 faulty ciphertexts under the MRAM persistent-fault model, significantly reducing the attack complexity compared to volatile-memory fault attacks that typically require 50-200 faults.
TS17.3 CAMI: A Context-Aware Isolation Architecture for GPU Memories 11:10
Hao Lan¹, Wei Yan¹, Qinfen Hao¹, Xiaochun Ye¹, Yier Jin², Yong Liu³, Ninghui Sun¹

¹ Institute of Computing Technology, CAS ; ² University of Science and Technology of China ; ³ Zhongguancun Laboratory

The widespread use of GPUs in cloud and high-performance computing makes memory isolation a critical security requirement. While the programming model assumes that each thread local memory is private, the underlying hardware does not always enforce this guarantee. Weaknesses in address translation can allow one thread to access another local memory, creating a semantic gap that enables cross-thread corruption and exploitation. To address these challenges, we propose CAMI , a hardware-level framework that integrates fine-grained execution context into the memory translation pipeline. CAMI enforces a binding between the execution context of each memory access and the ownership of its target memory page, ensuring that even subtle inconsistencies in translation cannot be exploited. By introducing an efficient hardware enforcement unit within the MMU and extending page table entries with ownership metadata, CAMI achieves strong, fine-grained isolation while maintaining low performance overhead. We implement CAMI in a cycle-accurate GPU simulator and conduct comprehensive evaluations. Results show that CAMI effectively eliminates cross-thread memory access vulnerabilities with minimal runtime cost, offering a practical path toward secure and high-performance GPU architectures.
TS17.4 ProCamo: A Fast Post-Manufacturing Programmable Camouflaged Logic Family 11:15
Seo Hyun Kim¹, Minhyeok Jeong², Jongmin Lee³

¹ Ajou Univerisity ; ² Dept. of Electrical and Computer Engineering, Sungkyunkwan University ; ³ Ajou University

Advances in semiconductor scaling and integration have increased design complexity, concentrating valuable IP in single chips. Reverse engineering using high-resolution microscopy techniques, such as scanning electron microscopy (SEM) and transmission electron microscopes (TEM), enables detailed circuit analysis and extraction of layout-level information. At the same time, reliance on external foundries increases the risks of design information leakage. To address these challenges, we propose a Fast Post-Manufacturing Programmable Camouflaged (FP2C) Logic Family, which consists of physically identical logic structures that are activated by applying a post-programming code (PC) after fabrication. The proposed FP2C logic-embedded Flip-Flop (FP2C logic-eFF) was implemented using a 28nm CMOS process, achieving a 67% reduction in cell area compared to prior Post-Manufacturing Programmed Threshold Voltage Defined (PMP-TVD) logic cells on the same technology node. Furthermore, this paper presents a systematic design methodology that integrates FP2C logic-eFF into an EDA tool-based digital circuit design flow. This enables FP2C logic to move beyond prior camouflaged logic that was limited to full-custom arithmetic unit implementations, and extend to complex digital IPs. To validate its feasibility, an AES module was designed and its functionality was verified through SPICE simulation, thereby demonstrating the applicability of FP2C logic to complex digital modules.
TS17.5 NuRedact: Non-Uniform eFPGA Architecture for Low-Overhead and Secure IP Redaction 11:20
Voktho Das¹, Kimia Zamiri Azar¹, Hadi Kamali¹

¹ University of Central Florida

While logic locking has been extensively studied as a countermeasure against integrated circuit (IC) supply chain threats, recent research has shifted toward reconfigurable-based redaction techniques, e.g., LUT- and eFPGA-based schemes. While these approaches raise the bar against attacks, they incur substantial overhead, much of which arises not from genuine functional reconfigurability need, but from artificial complexity intended solely to frustrate reverse engineering (RE). As a result, fabrics are often underutilized, and security is achieved at disproportionate cost. This paper introduces NuRedact, the first full-custom eFPGA redaction framework that embraces architectural non-uniformity to balance security and efficiency. Built as an extension of the widely adopted OpenFPGA infrastructure, NuRedact introduces a three-stage methodology: (i) custom fabric generation with pin-mapping irregularity, (ii) VPR-level modifications to enable non-uniform placement guided by an automated Python-based optimizer, and (iii) redaction-aware reconfiguration and mapping of target IP modules. Experimental results show up to 9x area reduction compared to conventional uniform fabrics, achieving competitive efficiency with LUT-based and even transistor-level redaction techniques while retaining strong resilience. From a security perspective, NuRedact fabrics are evaluated against state-of-the-art attack models, including SAT-based, cyclic, and sequential variants, and show enhanced resilience while maintaining practical design overheads.
TS17.6 Automated Hardware Trojan Insertion in Industrial-Scale Designs 11:25
Yaroslav Popryho¹, Debjit Pal², Inna Partin-Vaisband²

¹ UIC ; ² University of Illinois at Chicago

Industrial Systems-on-Chips (SoCs) often comprise hundreds of thousands to millions of nets and millions to tens of millions of connectivity edges, making empirical evaluation of hardware–Trojan (HT) detectors on realistic designs both necessary and difficult. Public benchmarks remain significantly smaller and hand-crafted, while releasing truly malicious RTL raises ethical and operational risks. This work presents an automated and scalable methodology for generating HT-like patterns in industry-scale netlists whose purpose is to stress-test detection tools without altering user-visible functionality. The pipeline (i) parses large gate-level designs into connectivity graphs, (ii) explores rare regions using SCOAP testability metrics, and (iii) applies parameterized, function-preserving graph transformations to synthesize trigger–payload pairs that mimic the statistical footprint of stealthy HTs. When evaluated on the benchmarks generated in this work, representative state-of-the-art graph-learning models fail to detect Trojans. The framework closes the evaluation gap between academic circuits and modern SoCs by providing reproducible challenge instances that advance security research without sharing step-by-step attack instructions.
TS17.7 On Oracle-Guided Random Circuit Learning via Stochastic Boolean Satisfiability 11:30
Shakil Ahmed¹, Kaveh Shamsi¹

¹ University of Texas at Dallas

Oracle-guided circuit learning (OGCL) or deobfuscation is the problem of recovering a set of secret key bits from a keyed circuit with the help of queries to a black-box functional oracle of the circuit. This problem has various applications in hardware security, such as in security analysis of obfuscation schemes, side-channel analysis, reverse engineering for trust, and Trojan detection. Boolean satisfiability (SAT)-based algorithms have been used here extensively. In this paper, we explore the adjacent problem of random circuit learning (OGRCL), which is the CL problem when the keyed circuit has an additional set of uncontrollable/unobservable random inputs with known probabilities. This can find applications in deobfuscation of probabilistic circuits, deobfuscation in the presence of noise, side-channel attacks in the presence of noise, optimal random circuit synthesis, and so on. We show for the first time that Boolean stochastic satisfiability (SSAT), which is a generalization of SAT to computing the probability of a given Boolean formula, can be used to devise generic random circuit learning procedures. We implement our proposed algorithms using modern SSAT solvers and showcase their superiority relative to a traditional black-box optimization approach on a set of benchmark circuits.
TS17.8 COVERT: Trojan Detection in COTS Hardware via Statistical Activation of Microarchitectural Events 11:35
Mahmudul Hasan¹, Sudipta Paria², SWARUP BHUNIA², Tamzidul Hoque¹

¹ The University of Kansas ; ² University of Florida

Commercial Off-The-Shelf (COTS) hardware, such as microprocessors, are widely adopted in system design due to their ability to reduce development time and cost compared to custom solutions. However, supply chain entities involved in the design and fabrication of COTS components are considered untrusted from the consumer's standpoint due to the potential insertion of hidden malicious logic or hardware Trojans (HTs). Existing solutions to detect Trojans are largely inapplicable for COTS components due to their black-box nature and lack of access to a golden model. A few existing studies rely on expensive equipment, lack scalability, and are applicable only to a limited class of Trojans. In this work, we present a novel golden-free trust verification framework, COVERT for COTS microprocessors, which can efficiently test the presence of hardware Trojan implants by identifying microarchitectural rare events and transferring activation knowledge from existing processor designs to trigger highly susceptible internal nodes. COVERT leverages Large Language Models to automatically generate test programs that trigger rare microarchitectural events, which may be exploited to develop Trojan trigger conditions. By deriving these events from publicly available Register Transfer Level implementations, COVERT can verify a wide variety of COTS microprocessors that inherit the same Instruction Set Architecture. We have evaluated COVERT on open-source RISC-V COTS microprocessors and demonstrated its effectiveness in activating combinational and sequential Trojan triggers with high coverage, highlighting the efficiency of the trust verification. By pruning rare microarchitectural events from mor1kx Cappuccino OpenRISC processor design, COVERT has been able to achieve more than 80% trigger coverage for the rarest 5% of events in or1k Marocchino and PicoRV32 as COTS processors.
TS17.9 Circuit-Aware Analysis of Arithmetic Error Detection Codes 11:40
Cheng Chiu¹, Keyon Mazandarani¹, Nathaniel Bleier²

¹ University of Michigan - Ann Arbor ; ² University of Michigan

In modern CMOS VLSI circuits, arithmetic datapaths are increasingly vulnerable to radiation-induced soft errors and circuit-aging, leading to application crashes or silent data corruption. However, existing studies lack a gate-level framework for evaluating the efficacy of arithmetic error-detection codes in the presence of logic and timing masking. We focus on arithmetic codes—such as AN codes and Redundant Residue Number System codes—that natively support arithmetic operations, unlike conventional linear block codes. We prioritize detection over correction given the low cost of re-execution and the rarity of radiation strikes relative to compute throughput. Because 48-bit values and pointers typically suffice in contemporary systems, we adopt a 48+16-bit (data+redundancy) organization. We introduce a statistical framework that models radiation strikes by injecting LET-based pulse-widths into netlists and emulates circuit-aging through delay scaling. The toolchain enables designers to screen arithmetic code schemes before tape-out and tailor resilience for deployments ranging from sea-level cosmic-ray exposure to the high-radiation conditions such as particle accelerators, spacecraft, and defense electronics. By exposing how code parameters and circuit topology interact at the gate level, the framework fosters tightly coupled code-hardware co-design.

TS18 Next-Generation Routing Optimization Techniques 11:00 - 12:30 | Traviata

Chair: Matthias Fuegger (CNRS/LMF, ENS Paris-Saclay, FR)

Co-Chair: Massimo Poncino (Politecnico di Torino, IT)

TS18.1 Provably Optimal Planar Pareto Nearest Neighbor Search with Double Monotone Chains 11:00
Zizheng Guo¹, Runsheng Wang¹, Yibo Lin¹

¹ Peking University

A core task in EDA is to bridge layout and topology: given planar pins, build a sparse graph that captures who should connect to whom, and then optimize on that graph. In timing-driven routing (e.g., Prim–Dijkstra), this means linking each point to its layout nearest neighbors. The right, metric- agnostic choice is the four-orthant Pareto/skyline neighbors, which preserve candidates for any distance model—but their standard construction has a quadratic time complexity. We introduce a novel double-monotone-chain sweep algorithm that computes all Pareto neighbors in optimal, output-sensitive time O(\sum i=1 to n ki) and O(n) space where n is the number of points and ki is the number of Pareto neighbors reported for point i. This removes the O(n^2) barrier while retaining full Pareto coverage. On large nets, our implementation produces Steiner trees with OpenROAD-level quality yet runs up to 39× faster. The resulting primitive is a practical gateway from geometry to topology that benefits layout-aware optimizations.
TS18.2 Near-Optimal TDM Ratio Assignment for Die-Level Routing in Multi-FPGA Systems 11:05
Jiawei Lin¹, Longkun Guo², Weijie Fang¹

¹ Fuzhou University ; ² Fuzhou University & Chinese Academy of Sciences Shenzhen Advanced Technology Academe

Modern multi-FPGA systems often integrate multiple dies to expand logic capacity and address the increasing complexity of integrated circuit designs. To overcome the limitations of physical I/O pins, these systems typically employ time-division multiplexing (TDM) technology. However, higher TDM ratios introduce considerable signal delays, resulting in higher critical connection delays. This paper focuses on optimizing the TDM ratio to tackle this challenge. We formulate the TDM ratio assignment problem as a block-angular convex program and solve it using Lagrangian decomposition, obtaining a $(1+\epsilon)$-approximate solution for any given $\epsilon > 0$. We further introduce a delay-aware TDM wire assignment scheme to achieve efficient signal assignment. Experimental results demonstrate that our method enables efficient, high-quality die-level routing in modern multi-FPGA systems, achieving up to $10.8\%$ reduction in critical connection delay compared to the state-of-the-art approaches.
TS18.3 DRVision: A DRV-aware Routability Optimization Framework with Multi-modal Prediction and Vision-Based Routing Guidance 11:10
Xu Cheng¹, Pengcheng Fan¹, Peng Cao¹

¹ southeast university

Achieving zero design rule violations (DRVs) is mandatory for tapeout, necessitating precise routability prediction and optimization. However, existing DRV prediction methods neglect inter-layer interconnect dependencies and rely on oversimplified binary DRV hotspot classification. Meanwhile, optimization strategies are hindered by fixed routing constraints, consequently failing to deliver effective DRV reduction or accelerate design convergence. To address these issues, we propose DRVision, a novel DRV-aware routability optimization framework that enhances routability by integrating multi-modal DRV hotspot prediction with vision-based guidance for regional DRV density variations, enabling accurate early prediction and dynamic optimization. We evaluated the proposed DRVision framework on the dedicated routability benchmarks from ISPD'15. Experimental results demonstrate the excellent predictive accuracy of DRVision, achieving an average SSIM of 82.45% and NRMSE of 8.06% for test designs. Furthermore, DRVision enables superior routability optimization, delivering average DRV reductions of 42.17% and 31.40% at the post-routing and post-routing optimization stage respectively, alongside a 19% runtime reduction compared to the baseline flow with design tools.
TS18.4 TANGRAM: A Novel On-Track Bus Routing via Placement and Compression of Polygons 11:15
Jaekyung Im¹, Seokhyeong Kang¹

¹ Pohang University of Science and Technology

Bus routing is an advanced topic of signal routing. Unlikely to the classical routing, the bus routing problem has complex constraints such as topology consistency and channel compactness. Existing bus routing algorithms mostly rely on the iterative maze routing, which is heavily time-consuming and sensitive to net ordering, thereby easily succumbing to suboptimality. To overcome this limitation, we propose a novel bus routing algorithm using placement and compression of channel polygons. Critically, our method does not use maze routing, thus highly fast and effective. Experimental results show that the proposed method achieves an average of 2% quality improvement over the best known results of ICCAD 2018 Contest benchmarks.
TS18.5 DSR: A Systematic Approach for Efficient Double-Sided Signal Routing 11:20
Jianqing Chen¹, Zhenkun Lin², Xun Jiang², Genggeng Liu¹, Yibo Lin², Gang Du²

¹ Fuzhou University ; ² Peking University

The emergence of back-side interconnects aims to sustain the continued scaling of semiconductor technology. To extend existing back-end tools, netlist planning has been introduced to transform single-sided netlists into double-sided ones, thereby exploring the potential of utilizing bridging cells for double-sided signal routing. However, the lack of a native double-sided routing approach that fully leverages both front-side and back-side resources hinders the effective handling of complex systematic requirements. In light of this, we propose a native double-sided signal routing approach DSR for the first time, which realizes efficient cross-layer path selection in 3D routing space by unified modeling of front-side and back-side resources. We develop a native double-sided global routing algorithm that jointly considers resource allocation and bridging cell insertion, guided by delay models for performance optimization. Under the guidance of global routing, we further extend the double-sided routing graph and incorporate delay-aware mechanisms to enhance resource allocation and routing quality in detailed routing. Experimental results demonstrate that, compared with existing works, the proposed approach achieves significant improvements in delay and runtime, while maintaining wirelength and eliminating Design Rule Violations (DRVs).
TS18.6 Submodular Maximization-inspired Adaptive Routing Bend Space Planning 11:25
Siting Liu¹, Peng Xu², Peiyu Liao², Keren Zhu³, Yibo Lin⁴, Bei Yu²

¹ Huawei Hong Kong Research Center ; ² The Chinese University of Hong Kong ; ³ Fudan University ; ⁴ Peking University

Routing can greatly impact tape-out chip performance by determining the physical layout of metal wire segments. As designs grow in complexity and size, modern routing frameworks struggle to manage limited routing resources among numerous nets efficiently. In this paper, we introduce a novel adaptive routing bend space planning framework, ARSP, that adaptively adjusts the routing bend space for each net based on the availability of routing resources throughout the routing flow. ARSP is built on a well-defined submodular maximization problem and uses an efficient approximation algorithm to ensure sub-optimal performance. Integrating ARSP with state-of-the-art routing flows shows an average improvement of 6.34% and 5.11% in reducing shorts and spacing violations, respectively. Additionally, our adaptive planning framework outperforms all static routing space planning strategies in both effectiveness and efficiency, showing the necessity of adaptive planning.
TS18.7 An Adaptive Cost-based Via and Congestion Co-optimization Framework for VLSI Global Routing 11:30
Zhaoyi Wu¹, Haishan Huang², Jianli Chen³, Zhifeng Lin¹

¹ Fuzhou University ; ² Shanghai LEDA Technology Co., Ltd., ; ³ Fudan University

Global routing is a critical stage in VLSI physical design, directly affecting the final Power, Performance, and Area (PPA) metrics. In this paper, we propose a high-performance global router that optimizes via count and routing congestion simultaneously. We first generate a 2D via-aware spine tree, which incorporates bend cost to reduce via usage while minimizing wire length. Then, a fast maze routing algorithm is employed to efficiently find a blockage-free path, followed by a congestion-aware layer assignment method to generate the 3D routing solution. Finally, we present an iterative rip-up and reroute strategy to resolve remaining congestion using the 3D bidirectional A* search. The A* search is guided by an adaptive cost function that dynamically adjusts via costs based on congestion, facilitating the co-optimization of congestion and via count. Compared to an advanced commercial tool and the leading academic engine OpenROAD, our algorithm achieves the best results in both overflow and via count, while preserving almost the same wire length.
TS18.8 Optimizing Multibit Flip-Flop Banking via Agile In-Placement PPA Co-Optimization 11:35
Huan-Yuan Chen¹, Yu-Ruei Lin¹, Mark Po-Hung Lin¹, Hung-Ming Chen²

¹ National Yang Ming Chiao Tung University ; ² Institute of Electronics, National Yang Ming Chiao Tung University

Multibit flip-flop (MBFF) banking has been widely adopted to reduce dynamic power, simplify clock tree structures, and minimize layout area. In our analysis, we reveal that early-stage banking, despite potential initial timing degradation, enables more extensive flip-flop merging, with subsequent placement refinements mitigating timing violations. Experimental results, benchmarked against the first-place winner from the 2024 ICCAD CAD Contest and most recent SOTA, demonstrate that our approach delivers competitive performance while providing enhanced design flexibility and superior power reduction.
TS18.9 BADGE: Boundary-Aware Dirichlet Graph Embedding for Initialization in DREAMPlace 11:36
Jaemin Park¹, Taejin Paik²

¹ Department of Mathematics, Sookmyung Women's University ; ² Samsung Electronics

Initialization is critical in analytical placement to prevent optimizers from being trapped in suboptimal local minima. We introduce BADGE, a Boundary-Aware Dirichlet Graph Embedding that leverages fixed macros as boundary conditions. By minimizing a weighted quadratic energy under Dirichlet constraints, BADGE generates a globally coherent initial placement. We further design boundary-attraction weighting and boundary-biased filler initialization to preserve embedding quality. Experiments on MMS benchmarks demonstrate that BADGE consistently reduces half-perimeter wirelength (HPWL) and global placement iterations compared to random and GiFt initializations. These results highlight the efficiency of boundary-aware initialization for scalable placers.

TS19 Quantization for Hardware-Efficient Neural Networks 11:00 - 12:30 | Tosca

Chair: Mohammad Hasan Ahmadilivani (Tallinn University of Tehnology, EE)

Co-Chair: Angeliki Kritikakou (Univ Rennes, Inria, CNRS, IRISA, FR)

TS19.1 MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation 11:00
Dahoon Park¹, Jahyun Koo², Sangwoo Hwang¹, Jaeha Kung¹

¹ Korea University ; ² DGIST

As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, the Open Compute Project (OCP) consortium standardized narrow precision formats for deep learning, called the microscaling (MX) format. The MX format is a hardware-friendly dynamic quantization scheme that effectively reduces the data size by sharing an 8-bit exponent across multiple operands. The MX format can be categorized into two types with their own strengths: (i) MXINT which focuses on a high precision consisting only of mantissa bits and (ii) MXFP which focuses on a wider dynamic range by allowing local exponent bits. In this work, we present a versatile MXFP format, called MX-SAFE (MXSF in short), that adaptively uses two modes, i.e., a wider mantissa mode (FP8 E2M5) and a subnormal FP mode (FP5 E3M2), to support both training and direct-cast inference. Furthermore, we propose a tile-based block design to increase hardware efficiency by reducing the burden of re-quantization process during the training with the MXSF format. Owing to the use of the proposed MXSF format, 0.05%/11.1% and 3.55%/3.57% improvements in accuracy, on average, for inference/full-training compared to MXFP8 E2M5 and MXFP8 E4M3 are observed, respectively. Moreover, we present a training-inference accelerator that supports the MXSF format and it achieves similar accuracy to the BF16 baseline while using 24.9% less total energy consumption.
TS19.2 PatBiNN: A 65nm Processing-in-CAM Based BNN Implementation for Pathogen Genome Classification 11:05
Yuval Harary¹, Almog Sharoni¹, Esteban Garzón², Leonid Yavits³

¹ Bar-Ilan University ; ² University of Calabria ; ³ bar ilan university

Binary Neural Networks (BNNs) are a cost-effective and highly efficient alternative to traditional neural networks. Genome classification is a frequent component of genome analysis pipelines, with a variety of applications spanning pandemic preparedness, AMR resistance control, drinking water and food safety. PatBiNN is a BNN based pathogen genome classifier optimized for edge and field use. It employs a binary multilayer perceptron (MLP) implemented using in-Hamming distance tolerant (similarity search) content addressable memory processing. PatBiNN was designed and manufactured in a commercial 65nm process. It achieves F1 score of 88%, ROC AUC of 0.986, throughput of 0.8M inferences/s, power consumption of 4.8mW and energy efficiency of 237TOPs/s/W with silicon area of 0.87mm2.
TS19.3 HI-APP: Hardware-friendly Fully-Integer Approximation of Nonlinear Functions in Quantized CLIP-ViTs 11:10
Beom Jin Kang¹, Hyun Kim¹

¹ Seoul National University of Science and Technology

Recently, vision–language models (VLMs) have delivered state-of-the-art multi-modal accuracy, yet deploying them on FPGA/ASIC accelerators remains costly: after quantizing general matrix multiply (GEMM) to low-bit integer math, nonlinear functions (NLFs) dominate resources and power. Prior NLF approximations either adopt high-precision piecewise-linear (PWL) methods that still consume substantial logic/DSP budgets, or low-precision integer surrogates that necessitate finetuning to recover accuracy. In this paper, we present hardware-efficient, training-free approximations for two representative NLFs, namely GELU and LayerNorm. First, for GELU, we propose a power-of-two (PoT) PWL scheme: we analytically study the LUT-entry/accuracy trade-off under input clipping, introduce an automatic clipping-point selection to meet a target error, and convert segment slopes to PoT to replace multipliers with shifts. Second, for LayerNorm, we eliminate floating-point operations in quantized pipelines via a PoT-based mean estimator and a log-based shift-LUT approximation of the reciprocal square root for variance normalization. Both designs compile a common shift-add datapath and co-optimize naturally with quantized GEMM. On quantized CLIP-ViTs, our approach is plug-and-play (no additional training) and incurs at most a 0.93% Top-1 drop on ImageNet. A prototype on Xilinx FPGA reduces DSP usage by up to 100%, LUTs by 69.8%, and FFs by 96.0%, delivering substantial gains in resource efficiency and deployability. These results indicate that simple, PoT-driven approximations can cap NLF overheads and enable practical, resource-aware VLM acceleration on reconfigurable and custom silicon. We provide the HI-APP implementation at https://github.com/IDSL-SeoulTech/HI-APP.
TS19.4 BOLD-Q: Blockwise Outlier-aware Logarithmic Dual-Bias Quantization for Hardware-Efficient LLM Inference 11:15
Sungsoo Han¹, Dahun Choi², Hyun Kim¹

¹ Seoul National University of Science and Technology ; ² Department of Electrical and Information Engineering, Research Center for Electrical and Information Technology Seoul National University of Science and Technology

Large language models (LLMs) deliver strong natural language processing performance, but ever-growing parameter counts strain memory and power budgets for on-device deployments. Quantization alleviates these costs; however, the outlier-heavy statistics of LLM activations and weights force calibration-based static schemes to retain high-precision fallbacks for dynamically varying values, yielding heterogeneous execution paths and overheads. Microscaling (MX) applies blockwise dynamic quantization with a per-block shared exponent, achieving a homogeneous execution path. Nevertheless, at 4-bit precision, prior work faces three limitations: (i) fixed bins fail to capture block-specific distributions and outliers; (ii) quantization error due to the limited resolution of shared-exponent scaling; and (iii) a lack of co-design approaches that balance model quality and hardware efficiency. We propose BOLD-Q, an HW/SW codesign quantization framework that combines the logarithmic number system (LNS) with MX. BOLD-Q introduces blockwise Dual-Bias—selected statically for weights via candidate search and dynamically computed for activations—to shift and refine per-block quantization bins, while LNS-based scaling improves distributional fit. On LLaMA-2 7B, BOLD-Q limits perplexity increase to +0.32 (W4/A8) and +0.60 (W4/A4), outperforming same-precision baselines. We further design an LNS-MAC systolic array with a lightweight preprocessing row that derives and broadcasts Dual-Bias, eliminating per-PE bias units; within the array, multiplies become log-domain additions, and rescaling is adder-based. Compared with a baseline, BOLD-Q reduces area by up to 34.0% and energy by 21.4%, enabling a homogeneous, on-device-friendly, low-precision execution path for LLMs. The code is available at https://github.com/IDSL-SeoulTech/BOLD-Q.
TS19.5 T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization 11:20
Hyunwoo Oh¹, KyungIn Nam¹, Rajat Bhattacharjya¹, Hanning Chen¹, Tamoghno Das¹, Sanggeon Yun¹, Suyeon Jang¹, Andrew Ding², Nikil Dutt¹, Mohsen Imani²

¹ University of California, Irvine ; ² University of California Irvine

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6–24.5× and 1.1–86.2× improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5–4.9× the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
TS19.6 IterQuant: Iterative Quantization Framework for Mixed-Precision LLM Compression 11:25
Hyungyo Jeong¹, Jiwoo Kim¹, Hyeokjun Kwon², Jaeho Lee¹, Youngjoo Lee³

¹ POSTECH ; ² ETRI ; ³ KAIST

Mixed-precision quantization is a promising approach for compressing large language models (LLMs) while maintaining output quality. However, the design space for selecting resolutions of different layers makes exhaustive search intractable. Existing methods either rely on rigid bit-width allocation schemes or require extensive hyperparameter tuning, often based on inaccurate layer-wise sensitivity metrics. In this work, we propose IterQuant, an iterative quantization framework that efficiently explores the mixed-precision space without requiring exhaustive enumeration. By incorporating momentum-based scoring to reflect historical performance trends and parameter grouping to balance quantization granularity, IterQuant achieves favorable trade-offs between compression and accuracy. Unlike prior approaches that assume bit allocation sensitivity from full-precision models directly transfers to quantized models, IterQuant dynamically updates its quantization decisions as the model evolves, better capturing inter-layer dependencies. Experimental results demonstrate that IterQuant significantly outperforms state-of-the-art mixed-precision quantization approaches by 2.8% near 4 bits in preserving token-level output quality across various LLM benchmarks.
TS19.7 GCPT: Gradient-aware Clustering Method for Efficient Post-Training Quantization in Large Neural Networks 11:30
Chuyi DAI¹, Chen Ye¹, Zeyu LI², Jun Tao¹, Wei Zhang³, Grace Li Zhang⁴, Xin Li⁵

¹ Fudan University ; ² The Hong Kong University of Science and Technology ; ³ Hong Kong University of Science and Technology ; ⁴ TU Darmstadt ; ⁵ Duke University

Large-scale neural network models have achieved outstanding performance across diverse tasks, but often come with expensive computational costs. In this paper, we propose a gradient-aware clustering method for post-training quantization (GCPT) in order to effectively reduce the computational overhead. Our key idea is to cluster the weights of linear layers based on their gradient-aware contributions to the overall loss function. Afterwards, all weights are replaced by a small set of cluster centroids to minimize the variation of the loss function due to quantization. To further accelerate inference, the inputs associated with those weights in the same cluster are first aggregated and then the sum is multiplied with the shared centroid, thereby reducing the number of scalar multiplications. Experiments on three large-scale models demonstrate that the proposed GCPT method achieves up to 93.8% computational cost reduction, while preserving memory usage and inference accuracy, compared to other state-of-the-art methods.
TS19.8 Fast and Energy-Efficient Support for Low-Precision LLMs on PIM 11:35
Byeori Kim¹, Sangjun Lee¹, EunHyeok Park¹

¹ POSTECH

Processing-in-Memory (PIM) has gained momentum as a promising approach for mitigating memory bottlenecks, and it is particularly well suited to autoregressive decoding in large language models (LLMs), where memory-bound General Matrix-Vector Multiplication (GEMV) operations account for a large portion of the workload. Due to the large size of LLMs, it is challenging to support them in PIM environments with limited memory capacity. Applying group-wise weight-only quantization (GWQ), widely used in LLMs, can effectively reduce model size while minimizing accuracy degradation. However, the weights in GWQ-applied LLMs are typically dequantized using scales and zero-points before GEMV is performed, which can introduce non-trivial latency overhead. In this paper, we propose a method for DRAM-PIM to efficiently support GEMV operations in symmetric and asymmetric GWQ-based LLMs with INT2 and INT4 precision. Based on the Newton scheme, the proposed method incurs an area overhead of 20.6% compared to the PIM units used in a 16-bank, single-channel system. When asymmetric GWQ with a group size of 128 is applied, it achieves approximately a 4× reduction in storage at INT4, along with a 1.16× speedup and 1.41× energy efficiency compared to FP16 GEMV. At INT2, it achieves an 8× reduction in storage, along with a 1.27× speedup and 1.57× energy efficiency.
TS19.9 HAP: Accelerating DNNs with Resolution-Preserved Quantization by Harnessing Adaptive-Precision 11:40
Erjing Luo¹, Xinkuang Geng², Honglan Jiang², leibo liu³, Jie Han¹

¹ University of Alberta ; ² Shanghai Jiao Tong University ; ³ Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University

Reducing the precision in post-training quantization can cause catastrophic accuracy loss in Deep Neural Networks, especially when compressing the activations. To address this problem, we present a novel adaptive-precision quantization (APQ) and accelerator design that achieves lossless activation compression by exploiting the inherent coding redundancy. Compared to existing APQ methods, this design can be generalized to implement asymmetric quantization, making it particularly suitable for activations. The accelerator offers a practical solution to mitigate the computational workload imbalance problem incurred by variable precision. A dual-precision quantization scheme further provides the flexibility to trade off accuracy and performance.
TS19.10 ArchE-Q: A DSP-Free Dataflow Accelerator for Quantized Neural Networks in Sensor-Aided Millimeter-Wave Edge Connectivity 11:41
Arish Sateesan¹, Ljiljana Simić¹, Marina Petrova¹

¹ RWTH Aachen University

Sensor-aided wireless edge applications, such as LiDAR-based beam prediction for millimeter-wave communications, demand intelligent on-device processing of high-volume sensor data. However, the computational cost of machine learning models often exceeds the tight power and resource constraints of edge hardware. While quantized neural networks (QNNs) reduce resource requirements, typical FPGA accelerators still rely on power-hungry digital signal processing (DSP) slices and incur avoidable data-movement overheads. To bridge this gap, we propose ArchE-Q, a dataflow accelerator for QNNs combined with efficient data preprocessing. Our design is fundamentally multiplier-less, utilizing: (1) an application-specific first-layer kernel that exploits binarized sensor inputs to remove multipliers; (2) the eXtended Vector Activation Unit (XVAU), fusing convolution, activation, and pooling to reduce buffering and data transfers; and (3) memory-centric buffering for efficient data reuse. Implemented on a Xilinx ZCU104 FPGA, ArchE-Q achieves 13.3% lower latency and up to 28% lower dynamic power than the FINN-R baseline, while eliminating DSP usage.

LBR01 Advances in Secure, Verified, and Efficient Hardware Architectures 11:00 - 12:30 | Aida

Chair: Piedad Brox Jiménez (CSIC, ES)

Co-Chair: Cédri Marchand (Ecole Centrale de Lyon, FR)

LBR01.1 Late Breaking Results: Practical Power Side-Channel Attack on Analog Compute-in-Memory Macro 11:00
Simon Wilhelmstaetter¹, Johannes Stark¹, Devanshi Upadhyaya², Mael Gay², Ilia Polian², Maurits Ortmanns¹

¹ University of Ulm ; ² University of Stuttgart

Analog compute-in-memory (ACIM) architectures emerged as energy efficient matrix–vector multiplication accelerators utilized for neural network inference in power-constrained environments. However, the security implications of ACIM hardware remain almost entirely unexplored. In particular, no previous work has evaluated information leakage of fabricated ACIM hardware through power side-channels. This work presents the first measured power side-channel analysis of a fabricated 28 nm ACIM macro. Using a convolutional neural network (CNN) workload, we show that power traces exhibit strong data-dependent information leakage that allows accurate reconstruction of private input images, with a mean structural similarity index measure (MSSIM) of up to 0.71.
LBR01.2 Late Breaking Results: Boosting Efficient Dual-Issue Execution on Lightweight RISC-V Cores 11:03
Luca Colagrande¹, Luca Benini²

¹ ETH Zurich ; ² Università di Bologna and ETH Zurich

Large-scale ML accelerators rely on large numbers of PEs, imposing strict bounds on the area and energy budget of each PE. Prior work demonstrates that limited dual-issue capabilities can be efficiently integrated into a lightweight in-order open-source RISC-V core (Snitch), with a geomean IPC boost of 1.6x and a geomean energy efficiency gain of 1.3x, obtained by concurrently executing integer and FP instructions. Unfortunately, this required a complex and error-prone low level programming model (COPIFT). We introduce COPIFTv2 which augments Snitch with lightweight queues enabling direct, fine-grained communication and synchronization between integer and FP threads. By eliminating the tiling and software pipelining steps of COPIFT, we can remove much of its complexity and software overheads. As a result, COPIFTv2 achieves up to a 1.49x speedup and a 1.47x energy-efficiency gain over COPIFT, and a peak IPC of 1.81. Overall, COPIFTv2 significantly enhances the efficiency and programmability of dual-issue execution on lightweight cores. Our implementation is fully open source and performance experiments are reproducible using free software.
LBR01.3 Late Breaking Results: A Power-Efficient RISC-V Baseband System-on-Chip for Multi-Standard Integrated Sensing and Communications 11:06
Limin Jiang¹, Yi Shi², Yihao Shen³, Yintao Liu¹, Siyi Xu¹, Qingyu Deng¹, Xiaoxiao Chen¹, Qianli Wang¹, Shan Cao³, Zhiyuan Jiang³, Sheng Zhou⁴

¹ School of Communication and Information Engineering, Shanghai University ; ² shanghai university ; ³ Shanghai University ; ⁴ Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University

We present Ishtar, a power-efficient RISC-V baseband system-on-chip (SoC) tailored for multi-standard integrated sensing and communications (ISAC) in low-altitude wireless networks (LAWNs). Ishtar integrates a hierarchical scheduling scheme and a system-level power-gating architecture that dynamically controls power domains to balance performance and energy efficiency. It supports dynamic task scheduling across heterogeneous protocols using a domain-specific, graph-based representation. Implemented in 40 nm technology and running at 300 MHz, Ishtar achieves better normalized efficiency than state-of-the-art SDR SoCs, delivering real-time multi-standard sniffing under stringent power and area constraints.
LBR01.4 Late Breaking Results: Float Fight - Verifying Floating-Point Behavior in RISC-V Simulators 11:09
Katharina Ruep¹, Manfred Schlaegl², Daniel Grosse¹

¹ Johannes Kepler University Linz ; ² Johannes Kepler University

In this paper, we enhance RVVTS, an open-source framework for testing RISC-V vector instructions, to enable comprehensive floating-point (FP) verification across various RISC-V simulators and FP libraries. Our enhanced RVVTS, referred to as FP-RVVTS, adds support for the RISC-V FP extensions (F, D, Zfh) through a novel context-free grammar specification with annotations, strengthened automatic single-instruction isolation, and improved failure cause analysis. In the experiments we show that FP-RVVTS generates FP test sets achieving over 95% functional coverage, reveals critical bugs in several RISC-V simulators, and, using isolated instructions, supports to narrow down the causes of failures.
LBR01.5 Late Breaking Results: Efficient Formal Verification of Highly Optimized MAC Units 11:12
Jan Kleinekathöfer¹, Lennart Weingarten¹, Kamalika Datta¹, Rolf Drechsler²

¹ University of Bremen ; ² University of Bremen/DFKI

The demand for compute-intensive applications such as AI/ML has led to the development of processors with complex functionalities. The Multiply Accumulate (MAC) unit is a vital component in these processors, but its verification is very challenging due to the highly optimized designs used to implement the MAC operation. In this paper, we show some interesting results for optimized MAC design verification using a formal proof engine, Symbolic Computer Algebra (SCA). For the first time, we exploit the combined benefit of phase and dynamic ordering in verifying MAC circuits, a capability not possible using state-of-the-art SCA proof engines.
LBR01.6 Late Breaking Results: PolyRAD - Polynomial Formal Verification of Restoring Array Dividers 11:15
Mohamed Nadeem¹, Chandan Kumar Jha¹, Rolf Drechsler²

¹ University of Bremen ; ² University of Bremen/DFKI

Formally verifying divider circuits is complex, and multiple effective methods have been developed. However, none of these methods provides an upper bound on the verification time, which limits their scalability for large divider circuits. Recently, Polynomial Formal Verification (PFV) based approaches have been investigated to ensure circuit correctness in polynomial time and space. However, there is no PFV based approach for the formal verification of dividers. In this paper, we introduce for the first time a two-level partitioning strategy and present PolyRAD, a novel PFV approach for verifying Restoring Array Divider (RAD). Finally, we prove that verification of RAD can be achieved in polynomial time, and conduct experimental evaluation on RAD of different sizes to validate our theoretical findings.
LBR01.7 Late Breaking Results: CHESSY: Coupled Hybrid Emulation with SystemC-FPGA Synchronization 11:18
Lorenzo Ruotolo¹, Giovanni Pollo¹, Mohamed Amine Hamdi¹, Matteo Risso¹, Yukai Chen², Enrico Macii¹, Massimo Poncino¹, Sara Vinco¹, Alessio Burrello³, Daniele Jahier Pagliari¹

¹ Politecnico di Torino ; ² IMEC ; ³ Politecnico di Torino and UniversitÃ di Bologna

The growing complexity of cyber-physical systems (CPSs) calls for early prototyping tools that combine accuracy, speed, and usability. Virtual Platforms (VPs) provide fast functional simulation, but hybrid co-emulation solutions, in which key digital components are deployed on FPGA, become necessary when accurate timing modelling is required and RTL simulation is too costly. However, existing hybrid emulation tools are mostly proprietary, and rely on vendor-specific FPGA features. To address this gap, we introduce an open-source framework that connects SystemC-based VPs with FPGA emulation, enabling full-system co-emulation of digital and non-digital components. The FPGA accelerates the execution of main digital subsystems, while a wrapper coordinates timing and communication with the VP through JTAG, maintaining synchronization with simulated peripherals. Evaluations using a RISC-V SoC, with an example in the biosignals processing domain, show up to 2500× speedup compared to RTL simulation, while maintaining less than 2× total simulation time relative to pure FPGA emulation.
LBR01.8 Late Breaking Results: Thermally Assisted RowPress-RowHammer Synergy for Cross-Row Bit Flips 11:21
Filip Roth Tronnes-Christensen¹, Ranyang Zhou¹, Gamana Aragonda¹, Abeer Almalky², Mohaiminul Al Nahian², Adnan Siraj Rakin³, Shaahin Angizi¹

¹ New Jersey Institute of Technology ; ² Binghamton University (SUNY) ; ³ Binghamton University

In this work, we shed light on a previously uncharacterized thermal-assisted disturbance vulnerability in modern DDR4 DRAM by exploiting the temperature sensitivity and dense cell layout of scaled memory devices, a phenomenon we call HeatHammer. HeatHammer operates by first RowPressing the near aggressor, keeping it activated long enough to raise its local temperature, delay refresh operations, and erode its electrical isolation, and then RowHammering the far aggressor to induce bit flips in the victim row. This thermally weakened state significantly amplifies disturbance propagation, closely resembling the Half-Double effect. Elevated temperatures accelerate charge leakage, shrink retention time, and reduce sense margins, thereby enabling disturbance effects to traverse through the compromised near aggressor via capacitive coupling and charge sharing. We evaluate this vulnerability on DRAM chips from two leading DRAM manufacturers and demonstrate that HeatHammer can substantially degrade the effectiveness of existing mitigation mechanisms such as Target Row Refresh (TRR).
LBR01.9 Late Breaking Results: Never-Stopping Inference: Self-Healing AI Accelerators on SRAM-FPGAs 11:24
Eleonora Vacca¹, Giorgio Cora¹, Luca Sterpone¹

¹ Politecnico di Torino

This work presents a self-healing runtime for AI accelerators on SRAM-based FPGAs that combines online fault detection with fine-grained partial reconfiguration to ensure continuous inference execution. The framework dynamically isolates and repairs faulty regions while remapping workloads to healthy resources, eliminating the need for redundant hardware and system downtime. The proposed approach reduces recovery latency by 3 orders of magnitude compared to the state-of-the-art.
LBR01.10 Late Breaking Results: RL-Based Macro Placement with Cell Clustering and Rudy Modeling for Routability Optimization 11:27
Youwen Wang¹, Hao Gu¹, Xinglin Zheng², Keyu Peng¹, Ziran Zhu³

¹ Southeast University ; ² Southeast Unniversity ; ³ School of Integrated Circuits, Southeast University

Macro placement critically impacts the routability of modern very large scale integration (VLSI) circuits. While reinforcement learning (RL) has emerged as a promising solution, existing approaches often optimize solely for HPWL while neglecting standard cell information, leading to severe routability degradation. To overcome these limitations, we propose a novel congestion-aware RL-based macro placement method that jointly optimizes wirelength and routability. Specifically, we introduce a hypergraph-based standard cell clustering technique to preserve netlist connectivity and utilize the RUDY mask for high-fidelity congestion estimation. To empower the RL agent, we construct a Swin Transformer-based U-Net architecture, which enhances feature extraction and transferability across designs. To effectively guide this agent, we also incorporate a local overlap-allowed strategy to expand the search space, combined with a multi-objective reward function that balances wirelength and congestion metrics. Experimental results on the ISPD 2015 benchmarks demonstrate that our method significantly outperforms state-of-the-art RL methods, routability-driven analytical placers, and the commercial tool Innovus, achieving superior macro placement quality.

BPA04 AI for Design and Test 11:00 - 12:30 | Nabucco

Chair: Marcello Traiola (Inria / IRISA, FR)

Co-Chair: Maksim Jenihhin (Tallinn University of Technology, EE)

BPA04.1 Lithography Hotspot Detection for Complex Non-Manhattan Layouts via Graph Neural Network 11:00
Bohao Li¹, Ranran Liu¹, Yumeng Liu¹, Cong Jiang², Kang Liu², Bei Yu³, Kun Ren¹, Qi Sun¹, Cheng Zhuo¹

¹ Zhejiang University ; ² Huazhong University of Science and Technology ; ³ The Chinese University of Hong Kong

Convolutional neural networks (CNNs) have been widely applied in lithography hotspot detection due to their strong feature extraction capability; however, low computational efficiency remains a critical bottleneck. Recently, graph neural networks (GNNs) have emerged as a promising alternative, offering both high inference speed and strong scalability to variable-sized inputs. Nevertheless, existing approaches model layouts by decomposing polygons into rectangles, which introduces redundant boundaries and struggles to handle complex non-Manhattan layouts. In this paper, we propose a novel graph representation that accurately extracts the critical geometric features of non-Manhattan layouts by modeling polygon contours. To capture the long-range interactions induced by optical effects, we introduce a hierarchical message-passing mechanism to encode both local and global layout structures efficiently. Furthermore, building on the graph representation, the clip-level labels of non-hotspots can be transformed into edge-level supervision. Accordingly, we incorporate multiple instance learning (MIL) to leverage the fine-grained supervision from non-hotspot clips, thereby enhancing the ability to distinguish between hotspot and non-hotspot clips. Experiments on industrial non-Manhattan datasets demonstrate that our method yields a 3.6% higher recall, 10.8% fewer false alarms, and a 1.7% increase in F1 score compared with the state-of-the-art (SOTA) methods. The industrial non-Manhattan layout used in this work is available at https://github.com/yb-hitsz/DATE2026-GNN4LSD.
BPA04.2 RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning 11:20
Khurram Khalil¹, Muhammad Mahad Khaliq¹, Khaza Anuarul Hoque¹

¹ University of Missouri

The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a 2.2X fault assessment speedup over evolutionary methods and reduces the required test vector volume by over 99% compared to random fault injection, all while achieving superior fault coverage. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a 12.8X improvement in cost-effectiveness (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
BPA04.3 Smart-PCLib: A LLM-based Multi-Agent Framework for Automated PCB Component Library Generation 11:40
Zhaohai Di¹, Jindong Tu¹, Zhiyuan HE², Yuan Pu², Jiawei Liu³, Chong Tong¹, Tsung-Yi Ho², Bei Yu², Tinghuan Chen¹

¹ The Chinese University of Hong Kong, Shenzhen ; ² The Chinese University of Hong Kong ; ³ Beijing University of Posts and Telecommunications

PCB design heavily relies on high-quality component libraries, while current library generation primarily depends on manual operation. To address the inefficiency and error-proneness inherent in this manual process, we propose Smart-PCLib, a novel multi-agent framework. It orchestrates a team of collaborative agents, each powered by a fine-tuned MLLM. These agents follow a structured workflow to extract data and generate code, governed by a robust verification-and-correction loop. A key innovation is PyPCLib, a Python-based domain-specific language (DSL) that reframes library creation as a structured code generation task, which not only improves reliability but also enables automated verification and modular design. Evaluated on a large-scale, diverse dataset, Smart-PCLib demonstrates high accuracy and efficiency, and its specialized agents outperform state-of-the-art general-purpose MLLMs on domain-specific tasks.

FS04 Focus Session: Chiplets - How far from Making Promise a Reality? (Panel) 11:00 - 12:30 | Auditorium

The future of chiplets is defined by modularity, scalability, and accelerated innovation in silicon design. As compute demands grow more specialized, the adoption of chiplets will make it possible to combine best-in-class components?each optimized for a specific function?into flexible, high-performance systems. This approach has the potential to reduce development time and cost while enabling greater customization across a wide range of markets. With growing ecosystem support and standardization, chiplets are paving the way for a more agile and collaborative era of semiconductor innovation. However, multiple challenges remain in making an open chiplet marketplace a reality, including interface standardization, simulation, security, and thermal and packaging considerations. This focus session will bring together organisations and companies working at different levels in the chiplet supply chain, to discuss how they are tackling some of these challenges, both individually and collaboratively, and what all of this means for the DATE community.

Chair: Andrea Andrea Kells (Arm, UK)

Co-Chair: Anton Anton Klotz (Fraunhofer EMFT, DE)

Organizers:

Andrea Andrea Kells (Arm, UK)
Anton Anton Klotz (Fraunhofer EMFT, DE)

FS04.01 Chiplets: how far from making promise a reality? (Panel) 11:00
Francisco Socal¹, Vincent Casillas², Bart, Placklé³

¹ ARM ; ² SiPearl ; ³ IMEC

Francisco Socal is a Director of Product Management in the Architecture and Technology Group at Arm. Francisco leads the market strategy for the AMBA specifications, its ecosystem collaboration program and other Arm system architectures. He brings over 25 years of experience in semiconductor and telecommunications industries in various technical, commercial and leadership roles. He holds a Master of Engineering in Multimedia Communications and various patents in image processing and encoding. Vincent Casillas joined SiPearl in its early days in 2020. An expert in software design, he has always worked closely with hardware architecture teams, particularly on innovative multimedia projects carried out at Valeo, STMicroelectronics and for Ateme's major customers. Vincent also brought his experience in the management of development projects and in process implementation to SiPearl. Before joining SiPearl, Vincent was Head of Software for the vehicle connectivity solutions division of the Valeo group. Vincent has an engineering degree from ISIMA with specialization in hardware architecture and circuit design. Bart Placklé holds a Master of Science degree and a postgraduate degree in telecommunications from the University of Hasselt (Belgium), and imec (Leuven, Belgium), respectively. He also obtained a postgraduate degree in executive business economics from KU Leuven (Belgium).  Bart started his career at Acunia, an imec spinoff, where he initially served as a lead silicon designer and later advanced to become the general manager of the hardware business unit.  In 2004, Bart joined Intel to create the company's in-vehicle infotainment business. As chief architect and later automotive CTO, he led the development of five generations of high-performance automotive solutions, driving Intel's automotive segment to become a multibillion-dollar business. In recognition of this contribution, Bart received the Intel Achievement Award in 2016. In 2021, Bart was appointed as the CTO of AXG Mobility-as-a-Service at Intel.   In 2023, Bart Placklé returned to imec, assuming the role of vice president of automotive technologies. In this capacity, he is leading the development of cutting-edge solutions that will shape the future of mobility.

LK02 On my perfect life: a tenured position while AI does the job 13:15 - 14:00 | Auditorium

Chair: Marcello Traiola (Inria)

LK02.01 On My Perfect Life: A Tenured Position While AI Does The Job 13:15
Rolf Drechsler¹

¹ University of Bremen/DFKI

Artificial Intelligence is reshaping how we explore, create, and teach. Tasks once considered inherently human are now increasingly delegated to systems that appear to reason, learn, and even invent. This development challenges long-standing notions of expertise, authorship, and the role of human judgment in science and education. Through this development, AI already influences how research is conducted, published, and taught - often with remarkable efficiency, yet also with opaque mechanisms and uncertain reliability. The enthusiasm surrounding these tools must be balanced by a critical assessment of their capabilities and limitations. What AI systems produce can be impressive, but their results often lack genuine understanding or intent. The central question is how to preserve creativity, responsibility, and depth of insight in a world where machines can assist with almost everything - except understanding what they do.

TS20 Memory-Centric and Neuromorphic Computing: Design and Simulation 14:00 - 15:30 | Tosca

Chair: Stefano Di Carlo (Politecnico di Torino, IT)

Co-Chair: Theofilos Spyrou (TU Delft, NL)

TS20.1 Fault-tolerance Mapping of Spiking Neural Networks to RRAM-based Neuromorphic Hardware 14:00
Yuqing Xiong¹, Chao Xiao², Zhijie Yang³, Lei Wang², Mengying Zhao¹

¹ Shandong University ; ² National University of Defense Technology ; ³ Defense Innovation Institute, Academy of Military Sciences

Spiking neural networks (SNNs) have been widely used in artificial intelligence applications. Resistive random-access memory (RRAM) based neuromorphic hardware can be used for low-power and high-speed inference of SNNs.However, RRAM devices suffer from stuck-at-fault (SAF) defects due to the immature fabrication process.SAF defects can lead to incorrect weights of SNNs and thus severely degrade the inference accuracy. In this paper, we propose a fault-tolerance synaptic-to-RRAM mapping scheme to deploy SNNs while protecting the inference accuracy. We first explore how different weights affect the model accuracy in SNNs and find that the frequency of spikes plays a vital role. Motivated by this, we develop an SNN-oriented metric to evaluate the importance of weights. Then we propose a defect-aware mapping scheme based on a simulated-annealing framework to efficiently map synapses to RRAM to mitigate the impact of SAFs. Evaluation shows that the proposed strategy can improve the accuracy by 18.74\% on average compared to existing mapping strategy.
TS20.2 MeshHD: Near-Linear Encoding for Hyperdimensional Computing via Multi-Scale Bases and Kronecker Factorization 14:05
Woong Jae Han¹, Jiseung Kim¹, Hyukjun Kwon¹, Hojeong Kim¹, Selim An¹, Shinhyoung Jang¹, Yeseong Kim¹

¹ DGIST

Hyperdimensional (HD) computing is attractive for low-power platforms, but common encoders flatten inputs and treat neighboring features as independent, discarding spatial structure and inflating the cost of a dense $F{\times}D$ apply. We present \textsc{MeshHD}, a spatially aware, \emph{relative} and \emph{multi-scale} base that maps 2D coordinates with random Fourier features to approximate a distance kernel; nearby locations receive similar hypervectors regardless of absolute position. We further introduce a compact Kronecker-structured apply that realizes the bundled base with three small GEMMs, reducing arithmetic and weight movement from $O(FD)$ toward a near-linear form while preserving encoder semantics. Our experimental results show that \textsc{MeshHD} consistently improves accuracy over the state-of-the-art nonlinear HD encoders, especially at smaller $D$, and reduces per-batch encoding time by $\sim 3{\times}$, with up to $10{\times}$ savings in encoder MACs/state at $D{=}10{,}000$.
TS20.3 Learning to Sense Without ADCs: Exploiting Phasic Responses from Diffusive Memristors 14:10
zhenhang zhang¹, Jingang Jin¹, Ruoyu Zhao², Zixu Wang², Tong Wang², Rui Zuo¹, J. Joshua Yang², Qinru Qiu¹

¹ Syracuse University ; ² University of Southern California

This paper presents a neuromorphic system that integrates a low-cost, energy-efficient, and bio-realistic spike encoder with a spiking neural network (SNN). The two are jointly optimized via an online learning algorithm to enable temporal pattern detection across multi-channel sensor inputs. At the core of the system is a novel analog-to-spike converter based on diffusive memristors, which replaces conventional ADCs and provide a fundamentally different encoding scheme with substantially lower power consumption and reduced device footprint. However, the inherent variability of diffusive memristor devices introduces significant challenges for both frontend and backend design. To address this, we propose an Adaptive Gain Unit (AGU) and a frontend-backend co-adaptation strategy that supports real-time online learning updates of both the AGU and the classifier network. Experimental results on nine publicly available time-series datasets show that this adaptation improves accuracy by 4.19% on average. Furthermore, compared to conventional 8-bit ADCs and state-of-the-art level-crossing ADCs, the diffusive memristor–based system achieves comparable classification accuracy while offering orders of magnitude lower power consumption and improved area efficiency.
TS20.4 TSIM4ICS: Trace-Driven SystemC-TLM Simulation Framework for I/O Die-Based Multi-Chiplet Systems 14:15
Youngchul Yoon¹, Soonhoi Ha¹

¹ Seoul National University

The growing demand for high-performance, energy- efficient, and heterogeneous computing has spurred research on chiplet architectures, which enable modular, scalable, and cost-effective system designs. While most prior studies focus on distributed Network-on-Package(NoP)-based chiplet architectures, this paper addresses multi-chiplet systems that employ an I/O die as a communication hub. The I/O die integrates and standardizes inter-chiplet and external interfaces, thereby enhancing modularity, scalability, multi-vendor integration, and performance-power optimization. We present TSIM4ICS, a trace-driven SystemC- TLM simulator for multi-chiplet systems that estimates end- to-end application performance. Each chiplet model generates communication traces to the I/O die, capturing die-to-die(D2D) links and DDR-interface latency/bandwidth to reveal the impact of remote accesses. Using a multi-NPU chiplet model running partitioned CNN workloads, our simulator allows exploration of the design space across the number of NPUs, chiplets, and workload partitioning, supporting HW/SW co-design. TSIM4ICS is publicly available online to promote reproducible and practical evaluation of chiplet systems.
TS20.5 FlashGEMM: Mesh-Aware Efficient GEMM for 3D-Stacked LLM Accelerators 14:20
Xin Fan¹, Chen BAI¹, Xin Yang², Zhenhua Zhu³, Yanhong Wang⁴, Zhaode Liu⁵, Yuan Xie¹

¹ Hong Kong University of Science and Technology ; ² Fudan University ; ³ Tsinghua University ; ⁴ Siorigin ; ⁵ Lightelligence Pte. Ltd.

Large language models (LLMs) are foundational to artificial general intelligence (AGI), while imposing stringent hardware demands in computing power and memory bandwidth. To meet these demands, recent advances in hybrid bonding offer new opportunities through high-bandwidth, low-latency logic-memory 3D integration. Due to the spatial distribution of bandwidth in 3D-stacked DRAM, accesses to remote DRAM banks need to traverse the network-on-chip (NoC). Since GEMM is the dominant operator in LLM workloads, we analyze existing distributed GEMM algorithms and show that they suffer from communication inefficiency on mesh topologies. To address these challenges, we present FlashGEMM, a mesh-aware distributed GEMM algorithm that achieves efficient communication-computation overlap and optimized execution flow for 3D-stacked LLM accelerators. Evaluations show that FlashGEMM delivers up to 1.50× improvement in time-to-first-token (TTFT) and 7.11× improvement in throughput over prior methods.
TS20.6 X-Sim: An Accurate and Scalable Simulator for Memristive Computing-in-Memory Accelerators 14:25
Konstantinos Stavrakakis¹, Bas Smeele¹, Emmanouil Arapidis¹, Theofilos Spyrou¹, Anteneh Gebregiorgis¹, Stephan Wong¹, Georgi Gaydadjiev¹, Said Hamdioui¹

¹ Delft University of Technology

Computing-in-Memory (CIM) architectures using memristive crossbar arrays enable energy-efficient AI acceleration. Analog non-idealities, such as IR drop and nonlinearity, impose design constraints that existing simulators cannot capture and thus explore effectively. Current approaches sacrifice either modeling accuracy or simulation speed, preventing systematic design space exploration. In this paper we propose X-Sim, a crossbar simulator that resolves this trade-off through a modular architecture. Our approach decouples device physics from circuit analysis using a fixed-point scheme, avoiding expensive Jacobian computations while preserving device fidelity. X-Sim delivers SPICE-level accuracy (< 1% error) with up to 200× speedup over physics-based simulators. This enables quick and systematic design space exploration across thousands of configurations, guiding reliable system design. X-Sim will be released as open source.
TS20.7 Graph-SRAM: Efficient Graph Learning-based SRAM Simulation via Waveform Propagation 14:30
Beisi Lu¹, Lihao Liu¹, Li Shang¹, Fan Yang¹

¹ Fudan University

High-speed SRAM arrays are essential for data-intensive Systems-on-Chip (SoCs). However, accurate timing characterization of these SRAMs requires transistor-level SPICE simulations, which are extremely time-consuming due to the large dimensions and complexity of modern designs. In this work, we present Graph-SRAM, an efficient graph learning-based simulation method using waveform propagation. Our approach models cells and interconnects as heterogeneous graphs and embeds global switching features, enabling a customized Graph Neural Network (GNN) to capture structural and functional patterns in both combinational and sequential circuits. Compared to HSPICE, Graph-SRAM achieves a significant speedup of 6905.32x while maintaining high accuracy, with an average error of only 4.28% in predicting path waveforms.
TS20.8 Enabling Ultra-Reliable Memories: A Practical Framework for Zero Mis-correction SEC-DED-DAEC Codes for Safety-Critical Systems 14:35
Guixiang Chen¹, Sheng Liu¹, Bo Yuan¹, Yang Guo¹

¹ National University of Defense Technology

In safety-critical systems such as autonomous driving and aerospace, memory reliability standards are evolving from "high-reliability" to "ultra-reliability," demanding the eradication of all foreseeable, deterministic failure modes. To address the prevalent challenge of Double Adjacent Errors (DAE) induced from radiation, the design of SEC-DED-DAEC codes faces a critical dilemma: efficient but flawed Hsiao-based codes that risk miscorrection, versus correct-by-construction but costly and inflexible OLS-based codes. This trade-off between efficiency and correctness presents a key barrier to designing ultra-reliable systems. To resolve this impasse, this paper introduces MCTS-CDB, a novel Computer-Aided Design (CAD) framework. By integrating a CDCL-inspired search with Monte Carlo Tree Search (MCTS) guidance, it systematically constructs codes that achieve a zero-miscorrection guarantee within the highly-efficient Hsiao architecture. Experimental results validate our approach, showing that compared to a wide range of existing schemes, our generated codes achieve the correctness while reducing average encoding and decoding delays by 24.15% and 13.66%. This work provides a practical solution for designing the ultra-reliable memory subsystems required by next-generation safety-critical applications.
TS20.9 MGPA: A Memristor-based Genome Processing Accelerator for Single-cell RNA Sequencing 14:40
Yang Han¹, Lianfeng Yu¹, Teng Zhang¹, Bowen Wang¹, Yihang Zhu¹, Lei Cai¹, Yaoyu Tao², Yuchao Yang¹

¹ School of Integrated Circuits, Peking University ; ² Institute of Artificial Intelligence, Peking University

With the rapid development of bioinformatics, genome processing tasks, including sequence alignment and classifications, face a serious conflict between their high computational density and the limited bandwidth of von Neumann architectures. Although in-memory computing (IMC) alleviates the contradiction, existing IMC-based genome processing architectures often incur excessive hardware overhead for nucleotide encoding. This work proposes a low-latency and energy-efficient Memristor-based Genome Processing Accelerator (MGPA), which utilizes a compact nucleotide representation scheme that reduces device count by 50%~75%. In simulations of single-cell RNA sequencing tasks, MGPA achieves a 394.6× improvement in energy efficiency and a 51.7× speedup over state-of-the-art memristor-based genome processing solutions.
TS20.10 Synthesizing Mixed-Mode Operations for Memristors using Majority Decomposition 14:41
Felix Bayhurst¹, Li-Wie Chen¹, Kefeng Li², Heidemarie Krüger², Nan Du², Ilia Polian³

¹ University of Stuttgart, Institute of Computer Architecture and Computer Engineering ; ² Friedrich Schiller University Jena, Leibniz Institute of Photonic Technology ; ³ University of Stuttgart Institute of Computer Architecture and Computer Engineering

Memristive technologies can enable novel mixed-mode (MM) circuit architectures, where diverse stateful and non-stateful logic operations are executed by the same physical device. Recently introduced optimal synthesis procedures for MM circuits have achieved 3-5X area and latency improvements compared with single-mode memristive logic families, yet such methods are not scalable. In this paper, we present a synthesis approach for MM circuits that leverages synthesis techniques for majority-inverter graphs (MIGs). MIG vertices are natural descriptions of non-stateful voltage-input (V-op) and stateful resistance-input (R-op) logic operations. Our synthesis can handle circuits with up to 27 inputs and achieves an average reduction of 80% in required devices and 65% delay when compared to a state-of-the-art approach for R-ops.

TS21 AI-driven Analogue and Mixed-Signal Design 14:00 - 15:30 | Traviata

Chair: Farshad Firouzi (Arizona State University, US)

Co-Chair: Matteo Sonza Reorda (Politecnico di Torino, IT)

TS21.1 MinFill: Reinforcement Learning and GNN Guided Reordering for Fill-In Reduction in RF Circuit Matrices 14:00
Hao Zhang¹, Dan Niu², Cheng Zhuo³, Zhou Jin³

¹ SSSLab, Dept. of CST, China University of Petroleum-Beijing, China ; ² Southeast University ; ³ Zhejiang University

Transistor-level circuit simulation is critical to radio-frequency (RF) design, with sparse matrix factorization dominating the runtime. The elimination order (reordering) largely determines fill-ins during factorization, thereby directly impacting floating-point operations and overall performance. However, prevailing reordering methods typically treat matrices as homogeneous graphs and fail to exploit the block-aware structure, which is dense within blocks and sparse across blocks, and is common in RF matrices. This results in limited fill-in reduction and high preprocessing and memory overheads for million-order matrices. In addition, learning methods that lack block-level semantics and legality constraints often yield suboptimal or even invalid eliminations at inference. Therefore, in this paper,we propose MinFill, a block-aware GNN--RL enhanced reordering framework. MinFill reconstructs the coefficient matrix as a tri-relational E/D/F (empty/diagonal/full) block graph, uses a multi-relational GraphSAGE encoder to fuse local density and cross-block coupling features, and trains a maskable PPO policy with a ``negative incremental fill-in'' reward and joint min-degree and min-potential-fill masks to stabilize learning of sparsity-friendly elimination orders. On six industrial RF matrices, MinFill reduces fill-ins by 22--39%, lowers peak memory by 12--29%, shortens factorization time by about one-third on average, and accelerates the reordering stage by 61.9x on average, delivering significant end-to-end speedups without exhaustive search.
TS21.2 IP-Matcher: An Efficient One-to-Many Matching Framework for Analog Circuit Design and Reusing 14:05
Shixin Chen¹, Peng Xu¹, Yapeng Li², Tinghuan Chen², Bei Yu¹

¹ The Chinese University of Hong Kong ; ² The Chinese University of Hong Kong, Shenzhen

The design efficiency of analog circuits is generally lower than that of digital circuits, presenting a significant bottleneck in the current integrated circuit industry. One promising method to accelerate design processes is the modular design philosophy adapted from digital methodologies. However, there is a lack of an efficient framework for reusing mature analog circuit topologies and the corresponding layout designs. To achieve a rapid design iteration while utilizing specialized expertise in design, we propose IP-Matcher, an efficient IP-based analog circuit matching and reusing framework. The framework consists of three components: Analog Graph Converter, Analog IP Manager, and IP-based Matcher, which collaborate to enhance both matching accuracy and speed, thereby improving analog IP reusability. We leverage the unique characteristics of analog circuits to significantly prune the matching space, overcoming the limitations of traditional circuit matching strategies. Experimental results show that our work not only outperforms the state-of-the-art method by 32\% in accuracy but also achieves a 16× speedup.
TS21.3 RASNIL: PVT-Robust Many-Objective Analog Sizing via Nested Hybrid Fidelity Framework with Incremental Learning 14:10
Xingyu Tang¹, Sen Yin¹, Zhujun Yao¹, Bingzhang Huang¹, Xiaosen Liu¹, Yan Wang¹

¹ Tsinghua University

Yield-driven analog circuit design under process, supply voltage, and operating temperature (PVT) variations remains a major challenge, particularly as technology advances and design goals diversify. Traditional yield analysis relies on time-consuming Monte Carlo simulations, while PVT-aware sizing often depends on hybrid fidelity model-based methods that suffer from slow training and limited efficiency. We propose an efficient algorithm to overcome these limitations. First, Monte Carlo simulations are replaced by a sensitivity-based fast yield estimation technique. Second, a hybrid fidelity Kriging model based on incremental learning with a self-adaptive training strategy greatly reduces training costs. Third, an efficient nested optimization framework incorporates prescreening to lower prediction time and a nested-selection mechanism based on Nondomination Rank and dynamic weighted Local Outlier Factor to enhance convergence and diversity. Finally, a general many-objective optimization strategy enables effective trade-offs among four or more design goals. Experiments on two real-world analog circuits show that our algorithm reduces model training time by up to 99%, achieves 4.68× acceleration in total runtime, improves Hypervolume by up to 430%, and consistently produces high-yield designs (>97%), outperforming state-of-the-art approaches in yield-driven, PVT-aware, many-objective sizing problems.
TS21.4 Substrate: A Statically Typed Framework for Designing Highly Configurable Analog and Mixed-Signal Circuit Generators 14:15
Rahul Kumar¹, Rohan Kumar¹, Borivoje Nikolic¹

¹ University of California, Berkeley

Analog and mixed-signal (AMS) integrated circuit design is often a time-consuming and costly process, due in part to manual design flows and long layout iterations. A number of tools have been developed aiming to automate the process of creating AMS designs. However, existing tools are often difficult to use due to unclear application programming interfaces (APIs), limited levels of abstraction, or insufficient control over generated collateral. We introduce Substrate, an open-source, statically typed framework for creating highly configurable schematic and layout generators using the Rust programming language. Substrate provides multiple levels of abstraction, allowing designers to navigate the tradeoff between fine-grained control over a design and increased automation. We also describe algorithms for programmatically creating and modifying circuit layouts, including two methods for automatically adjusting the aspect ratio of a layout. We use Substrate to design generators for a StrongARM comparator and a programmable resistor bank in Skywater 130nm and Intel 16nm, demonstrating 90 degree rotation, array folding, and the ability to change the aspect ratio by a factor of over 10 in both processes. These generators highlight Substrate's ability to facilitate design reuse, process portability, and performance and area optimization.
TS21.5 ACEMARL: Adaptive Clustering Enhanced Multi-Agent Reinforcement Learning for Analog Circuit Sizing 14:20
Han Wu¹, Haoning Jiang¹, Zhuoli Ouyang¹, Ziheng Wang¹, Qi Shen¹, Bo Yuan¹, Yan Lu², Junmin Jiang¹

¹ Southern University of Science and Technology ; ² Tsinghua University

Analog circuit sizing remains a critical bottleneck in integrated circuit design, requiring extensive manual effort and computational resources. While multi-agent reinforcement learning (MARL) accelerates optimization through parallel agent training, existing approaches rely on manual circuit block clustering that fails to capture functional relationships between parameters. This paper presents ACEMARL, an adaptive clustering framework that automatically discovers functionally similar parameter clusters. ACEMARL integrates Bi-population Covariance Matrix Adaptation Evolution Strategy (BIPOP-CMA-ES), a high-performance evolutionary algorithm, for multi-modal exploration with data-driven clustering, aiming for automatic agent assignment. Experimental validation on amplifier and low-dropout regulators with up to 179 parameters demonstrated 3.3-5.0× faster convergence and 5.7-38.5% Figure-of-Merit (FoM) improvement compared to state-of-the-art (SOTA) block-based methods. The framework reduced confidence interval width by 31.6-60.7% along mean reward trajectories, enabling fully automated analog circuit sizing with improved stability and performance.
TS21.6 Simulator-Driven Deep Reinforcement Learning for Analog Circuit Design 14:25
Felicia Guo¹, Ken Ho², Andrei Vladimirescu³, Borivoje Nikolic³

¹ University of California Berkeley ; ² UCB ; ³ University of California, Berkeley

This work addresses the use of reinforcement learning in the design of analog and mixed-signal (AMS) circuits. With recent advanced angstrom-technology-nodes adding new complexities, this highly manual process has grown increasingly challenging and less aligned with conventional design intuition. The presented approach modifies circuit topologies at the transistor-level to meet design requirements. We present, for the first time, a deep reinforcement learning (RL) framework capable of generating novel circuit topologies by using graph encodings for targeted specifications, starting from a minimal expert design and a user-specified testbench. To highlight the capabilities of the approach, we demonstrate the topological modification and expansion of incomplete sub-circuits to satisfy user-provided performance for three different types of circuits: 1) a ring oscillator, 2) a comparator, and 3) an operational transconductance amplifier. Our results demonstrate that our method is capable of generating previously unseen topologies that reach user-defined performance targets. In each design case, 100% of generated circuit netlists are correct by construction and over 90% of generated circuits demonstrate intended functionality and targeted performance when simulated with commercial tools.
TS21.7 AutoPMS: A Framework for Automated Power Management System Design via Hierarchical Modeling and Embedded Design Knowledge 14:30
Bin Ye¹, Shuo Li¹

¹ Fudan University

Design automation of analog and mixed-signal (AMS) systems is becoming increasingly critical and challenging as both design complexity and scale continue to grow. Power management systems (PMS), as essential AMS components of modern electronics, must satisfy stringent power delivery specifications, which significantly increase design risks and prolong iteration cycles. Prior work on power management design automation has primarily focused on specific topologies or individual blocks, falling short of meeting full system requirements. To bridge this gap, we propose AutoPMS, an automated framework for large-scale PMS generation. The end-to-end framework automatically generates PMS netlists from user-defined system specifications. AutoPMS employs a hierarchical modeling approach, integrating design-knowledge-based models at both the system and block levels to accelerate the automation process and ensure correctness. In addition, we propose a design-coefficient feedback scheme to enhance modeling accuracy and a weight self-adjusting mechanism to improve the overall success rate of the design flow. To the best of our knowledge, AutoPMS is the first framework to automate multi-output PMS design across a wide dynamic power range, from nanowatts to several watts. This framework offers a practical and scalable solution for generating diverse PMS designs, enabling the agile deployment of various electronic systems.
TS21.8 CaDRO: Causal-Guided Dimensionality Reduction for Scalable Multi-Objective Pareto Optimization 14:35
Dinithi Jayasuriya¹, DIVAKE KUMAR¹, Sureshkumar Senthilkumar¹, Devashri Rajesh Naik², Nastaran Darabi¹, Amit Trivedi²

¹ University of Illinois Chicago ; ² University of Illinois at Chicago

Multi-objective optimization of analog circuits is hindered by high-dimensional parameter spaces, strong feedback couplings, and expensive transistor-level simulations. Evolutionary algorithms such as Non-dominated Sorting Genetic Algorithm II (NSGA-II) are widely used but treat all parameters equally, thereby wasting effort on variables with little impact on performance, which limits their scalability. We introduce CaDRO, a causal-guided dimensionality reduction framework that embeds causal discovery into the optimization pipeline. CaDRO builds a quantitative causal map through a hybrid observational- interventional process, ranking parameters by their causal effect on the objectives. Low-impact parameters are fixed to values from high-quality solutions, while critical drivers remain active in the search. The reduced design space enables focused evolutionary optimization without modifying the underlying algorithm. Across amplifiers, regulators, and RF circuits, CaDRO converges up to 10× faster than NSGA-II while preserving or improving Pareto quality. For instance, on the Folded-Cascode Amplifier, hypervolume improves from 0.56 to 0.94, and on the LDO regulator from 0.65 to 0.81, with large gains in non-dominated solutions.
TS21.9 DiffResist: Physics-Constrained Diffusion for Photoresist Modeling 14:40
Zixiao WANG¹, Jieya Zhou², Xinyun Zhang³, Shoubo Hu⁴, Farzan Farnia¹, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² the Chinese University of Hong Kong ; ³ ShanghaiTech University ; ⁴ Huawei

Accurate and efficient photoresist simulation is essential for optical lithography, enabling advanced semiconductor manufacturing. Existing methods for 3D resist profile prediction from aerial images are computationally expensive, limiting their practical utility. To accelerate this process while maintaining accuracy, we formulate the problem as a 2D sequential generation task conditioned on aerial images. However, standard sequential generation methods suffer from error accumulation over steps, degrading prediction quality. To overcome this, DiffResist introduces a novel physics-constrained diffusion model that leverages resist exposure physics to constrain error propagation. By integrating sequential generation into the diffusion model's reverse process, we enhance computational efficiency without sacrificing accuracy. Combining a two-stage noise schedule with a super-resolution network, DiffResist sets a new state-of-the-art in accuracy and speed on benchmark datasets.
TS21.10 GRAIN: A Design-Intent-Driven Analog Layout Migration Framework 14:41
Bingyang Liu¹, Haoning Jiang², Haoyi Zhang³, Xiaohan Gao⁴, Zichen Kong³, Xiyuan Tang³, David Z. Pan¹, Yibo Lin³

¹ University of Texas at Austin ; ² Southern University of Science and Technology ; ³ Peking University ; ⁴ The University of Texas at Austin

Migrating a validated analog layout across technology nodes remains labor-intensive. Recent automatic migration methods often miss multi-level design intent embedded in expert layouts and may suffer from routing-induced LVS violations and unstable placement behaviors. We present GRAIN, a design-intent-driven analog layout migration framework that performs constraint-aware hierarchical placement migration to preserve multi-level placement behaviors, and uses guide-based routing that decouples similarity from legality via a maze router to reliably produce LVS-clean layouts. Experiments on real designs migrated from 65 nm to 40 nm and 28 nm show that, compared to a recent representative analog layout migration framework, GRAIN delivers 100% LVS-clean layouts without manual fixes and reduces area and wirelength by 13.8% and 29.2% on average, while also yielding post-layout metrics closer to the schematic.

TS22 LLMs & AI-Augmented Electronic Design Automation (EDA), CAD, and Hardware Systems 14:00 - 15:30 | Nabucco

Chair: Lorena Anghel (Grenoble INP-UGA/SPINTEC, FR)

Co-Chair: Mahdi Taheri (BTU Cottbus, DE)

TS22.1 LATIAS: A General Architecture-Operator Model for Spatial Accelerators with Complex Topology and Memory Hierarchy 14:00
Chengrui Zhang¹, Liancheng Jia¹, Chu Wang¹, Tianqi Li¹, Renze Chen¹, Xiuping Cui¹, Size Zheng², Shengen Yan³, Xiuhong Li⁴, Yu Wang³, Xiang Chen¹, Yun (Eric) Liang¹

¹ Peking University ; ² ByteDance Ltd ; ³ Tsinghua University ; ⁴ Peking University, Beijing

Spatial accelerators are widely deployed for deep neural networks, but their architectural diversity—from hierarchical to dataflow designs—makes accurate architecture–operator modeling difficult, limiting operator optimization and hardware utilization. Existing models abstract hardware as hierarchical chains and operators as loop trees, which cannot capture essential features of modern dataflow accelerators, including heterogeneous processing elements (PEs), uni-directional interconnects, and cross-PE memory hierarchies, leading to inaccurate latency prediction. We propose LATIAS, a unified framework that introduces (1) an architecture graph with uni-directional edges to represent arbitrary topologies, and (2) a dataflow-aware tile-centric notation that augments loop trees with transfer nodes to model diverse dataflows. Building on these, LATIAS further provides a graph-guided tree analysis that accurately resolves tensor residency and latency under hardware constraints. Experiments on representative operators (GEMM, vector, fused vector) and operator shapes extracted from DNNs (BERT, ViT, T5) on Huawei Ascend 910B3 show that LATIAS achieves over 0.99 correlation with runtime measurements—substantially outperforming prior models—and provides actionable insights for architectural design.
TS22.2 CHIP-MAP: A Collaborative Optimization Framework for Macro Placement Using Large Language Models 14:05
Yiming Du¹, RenYe Yan², yunfan yang¹, Frank Qu³, Jiajun Tan⁴, ZhiYu Zheng⁵, Yiming Gan⁶, LING LIANG¹, Zongwei Wang¹, YiMao Cai¹

¹ Peking University ; ² peking university ; ³ UCSB ; ⁴ Peking university ; ⁵ Fudan University ; ⁶ Institute of Computing Technology, Chinese Academy of Sciences

As integrated circuits continue to grow in both scale and complexity, macro placement plays a critical role in physical design, directly affecting chip-level performance, power, and area (PPA). Traditional macro placement methods, such as simulated annealing, analytical optimization, and reinforcement learning, face limitations including slow convergence, heavy dependence on large datasets, and over-reliance on intermediate PPA indicators rather than final PPA. Large language models (LLMs) offer strong generative power and semantic reasoning that can potentially automate macro layout tasks while addressing the aforementioned problems in traditional methods, but their limited understanding of layout rules and lack of iterative, feedback-driven refinement make direct application challenging. To address this, we propose CHIP-MAP, a macro placement framework based on multi-agent collaboration and feedback-driven optimization. Furthermore, we introduce two innovative tools: the Module Link Weight Analyzer (MWA) and the Standard Cell Usability Score (SCUS), which are designed to guide fine-grained layout refinement. We evaluate CHIP-MAP on five benchmarks ranging from low-power cores to large multi-core processors implemented at 130nm and 45nm technology nodes. Results show that it achieves up to 1.5% area reduction and an average repair of 61.6% of total negative slack (TNS), while also reducing wirelength and improving timing.
TS22.3 LithoMamba: High-fidelity lithography simulation with State Space Models 14:10
Xinyu He¹, Daohui Wang¹, Shujing Lyu¹, Pourya Shamsolmoali¹, Jiwei Shen¹, Yue Lu¹

¹ East China Normal University

Lithography simulation is a critical technology in modern semiconductor manufacturing, yet existing deep learning models often fail to accurately model the complex, long-range optical physics due to the inherent locality of convolution. This limitation results in insufficient simulation fidelity and poses significant challenges for optimization tasks. To overcome this challenge, we introduce LithoMamba, the first generative framework to leverage Mamba for high-fidelity lithography simulation. Our architecture uses a Mamba Generator to model global and long-range optical interactions, while a local, MLP-free Discriminator provides precise, spatial feedback to ensure fine-grained pattern fidelity. This global-local design enables our model to achieve both physical realism and exceptional detail. Our experiments show that LithoMamba outperforms existing methods, both in quantitative and qualitative results. These findings demonstrate the promise of State Space Models for improving lithography simulation and suggest new possibilities for combining physics with generative AI in chip manufacturing.
TS22.4 Explainable Hardware Trojan Detection at RTL using Attention Mechanism 14:15
Siyu Tian¹, Wei Hu², Lingjuan Wu³, Tianle You¹, Hao Su², Jiacheng Zhu²

¹ Huazhong Agricultural University ; ² Northwestern Polytechnical University ; ³ College of Informatics, Huazhong Agricultural University

Hardware Trojans pose a severe threat to the security and trustworthiness of integrated circuit (IC) designs. Existing Trojan detection techniques see shortcomings in low accuracy, reliance on expert knowledge for manual feature extraction and lack of explainability. This work proposes a novel explainable hardware Trojan detection solution at register transfer level (RTL) using attention mechanism. The proposed method captures rich circuit structural and semantic features from the path-contexts in the abstract syntax tree representation of RTL design. A code2vec deep learning model with attention mechanism is trained for automated feature extraction and distinguishing Trojan-infected designs from Trojan-free ones. It further develops an explainable approach by analyzing the attention weights in classification to understand the decision-making mechanism of the trained model. Experimental evaluations using 63 Trust-Hub Trojan benchmarks and 147 Trojan-free design cores demonstrate that the proposed method achieves promising detection results with TPR, TNR and F1-score of 100%, 99.43% and 99.45% respectively. Explanation results show that the decision mechanism of our code2vec model is closely related to the hardware Trojan trigger logic.
TS22.5 MAEDA: An LLM-Powered Multi-Agent Evaluation Framework for EDA Tool Documentation QA 14:20
Zhenghao Chen¹, Yuan Pu², Hairuo Han³, Yuntao Nie¹, Jiajun Qin², Yuhan Qin², Tairu Qiu⁴, Zhuolun He², Jianwang Zhai¹, Bei Yu², Kang Zhao¹

¹ Beijing University of Posts and Telecommunications ; ² The Chinese University of Hong Kong ; ³ Chinese University of Hong Kong ; ⁴ ChatEDA Lab

Large Language Models (LLMs) have shown remarkable capability in knowledge-intensive scenarios, such as electronic design automation (EDA) tool documentation question answering (QA), due to their ability to process and generate contextually rich, domain-specific information. Evaluating LLM outputs is paramount, as it directly impacts their accuracy, effectiveness, and trustworthiness in practical applications. In this paper, we introduce MAEDA, a novel LLM-powered multi-agent evaluation framework that utilizes multiple fine-tuned LLM agents working collaboratively to assess common error types encountered in EDA tool documentation QA. Specifically, we design customized point-to-point alignment and chain-of-thought (CoT) reasoning strategies tailored to specific agents, enhancing both fine-tuning and inference capabilities. Experimental results demonstrate that MAEDA outperforms state-of-the-art (SOTA) general-purpose and cross-domain evaluation frameworks in accurately identifying error types specific to this domain. Our benchmark is publicly available at https://github.com/Rayzzz14/MAEDA-DATE26/.
TS22.6 STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models 14:25
Xiangyu Shi¹, Junyang Ding¹, Xu Zhao¹, Sinong Zhan¹, Payal Mohapatra¹, Daniel Quispe¹, Kojo Welbeck¹, Jian Cao¹, Wei Chen¹, Ping Guo¹, Qi Zhu¹

¹ Northwestern University

Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models (LLM)-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of $\sim$40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search (DFS)-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that explicitly guide global coherence. We integrate retrieval-augmented generation (RAG) to ground predictions in relevant examples for supervised fine-tuning (SFT), and further refine generation quality through reinforcement learning (RL) with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strategy strengthens overall accuracy, and the RL refinement further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results demonstrate the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.
TS22.7 ChatTest: Coverage-Enhanced Testbench Generation for Agile Hardware Verification with LLMs 14:30
Gwok-Waa Wan¹, Shengchu Su², Jingyi Zhang³, Sam Zaak Wong⁴, Mengnv Xing⁵, Lei Ji⁵, Zhe Jiang⁶, Xi Wang¹, Yang Jun¹

¹ Southeast University ; ² School of Integrate Circuits, Southeast University ; ³ School of Integrated Circuits, Southeast University ; ⁴ Nctieda ; ⁵ National Center of Technology Innovation for Electronic Design Automation ; ⁶ University of Cambridge

The growing complexity of modern hardware designs has rendered traditional functional verification increasingly time-consuming, with verification costs now dominating the design cycle. While large language models (LLMs) show promise in automating testbench generation, existing approaches struggle with real-world scalability, suffering from poor comprehension of long specifications and complex designs. To address these challenges, we propose \textbf{ChatTest}, a novel, end-to-end, multi-agent LLM framework for coverage-aware, agile hardware verification. Our key innovation lies in a function-mapped, divide-and-conquer architecture that integrates a Verification Description Language (VDL)—a structured, LLM-friendly DSL for precise specification encoding—with Constraint-Aware Segmental Adaptation (CASA) to enable coherent processing of long, heterogeneous design documents. By leveraging retrieval-augmented generation and supervised fine-tuning using multi-hierarchical specification-code alignment, ChatTest ensures accurate translation of functional points into targeted test stimuli. Furthermore, we introduce a coverage-driven feedback loop for automated test augmentation. Evaluated on a new benchmark of 20 complex RTL designs (up to 31K tokens of specification and 4K line-of-code), ChatTest achieves 1.46× higher toggle coverage and 2.28× higher line coverage than SOTA, with a 24.23% improvement in functional coverage, demonstrating its effectiveness in accelerating verification convergence.
TS22.8 GANGR: GAN-Assisted Scalable and Efficient Global Routing Parallelization 14:35
Hadi Khodaei Jooshin¹, Inna Partin-Vaisband¹

¹ University of Illinois Chicago

Global routing is a critical stage in electronic design automation (EDA) that enables early estimation and optimization of the routability of modern integrated circuits with respect to congestion, power dissipation, and design complexity. Batching is a primary concern in top-performing global routers, grouping nets into manageable sets to enable parallel processing and efficient resource usage. This process improves memory usage, scalable parallelization on modern hardware, and routing congestion by controlling net interactions within each batch. However, conventional batching methods typically depend on heuristics that are computationally expensive and can lead to suboptimal results (oversized batches with conflicting nets, excessive batch counts degrading parallelization, and longer batch generation times), ultimately limiting scalability and efficiency. To address these limitations, a novel batching algorithm enhanced with Wasserstein generative adversarial networks (WGANs) is introduced in this paper, enabling more effective parallelization by generating fewer higher-quality batches in less time. The proposed algorithm is tested on the latest ISPD'24 contest benchmarks, demonstrating up to 40% runtime reduction with only 0.002% degradation in routing quality as compared to state-of-the-art router.
TS22.9 EstCoder: A RTL Code Generator based on Static Functional Estimation 14:40
Qi Xiong¹, Renzhi Chen², Zhigang Fang¹, Bowei Wang¹, Yingjie Zhou¹, Libo Huang¹, Lei Wang¹

¹ National University of Defense Technology ; ² Defense Innovation Institute,Academy of Military Sciences (AMS)

Optimizing register transfer level (RTL) code is of vital importance in hardware design. Large language models (LLMs) provide new methods for the automatic generation and optimization of RTL code. However, existing methods for generating RTL code often focus on model fine-tuning and the use of various expansion techniques to enhance the RTL code generation capabilities, lacking attention to the functional correctness. To address this issue, we propose EstCoder, an LLM-powered collaborative agent framework for RTL code generation based on static functional score estimation. EstCoder operates a three-stage paradigm: Generation, Estimation and Correction. During the stages, the functional estimation agent statically evaluates the generated code based on score and assessment results, and decides whether to output the code directly, return it for regeneration, or forward it to the code correction agent. This famework can be applied to various LLMs that designed for RTL code generation, further enhancing the correctness of the generated code. By providing quantitative scores and human-readable requirements comparisons, it improves the transparency of AI-assisted RTL code generation. Experiments show that EstCoder significantly improves the correctness of RTL code generation by generic LLM by 3.2%-9.0%, demonstrating the practical value of our system.
TS22.10 FAST: A Scalable Framework for Accelerating Flexible Structured Sparse Training 14:41
Shuaiheng Li¹, Jun Liu¹, Xinhao Li², Yaoxiu Lian¹, Tianlang Zhao¹, Li Ding¹, Guohao Dai¹

¹ Shanghai Jiao Tong University ; ² Southeast University

Sparse training is a critical approach to reducing the storage requirement while maintaining the model's ability. However, it is non-trivial to apply the flexible structured sparsity (flex-SS) patterns during sparse training, which achieves Pareto optimality in terms of hardware efficiency and flexibility. we propose FAST, a fast and scalable framework that supports LLM training with flex-SS patterns. First, we propose a probability-based decoupling method that eliminates dependencies between tiles to generate the flex-SS mask efficiently. Second, we propose a weight-distribution-aware pivot search strategy that narrows down the available region of pivot candidates to reduce the communication overhead. Extensive experimental results show that FAST achieves up to 10.40× and 1.56× end-to-end training speedup compared with PyTorch and the SOTA framework.

TS23 Processors, NPUs and Network-on-Chip 14:00 - 15:30 | Aida

Chair: Sara Vinco (Politecnico di Torino, IT)

Co-Chair: Tiziana Margaria (University of Limerick, IE)

TS23.1 DARE: An Irregularity-Tolerant Matrix Processing Unit with a Densifying ISA and Filtered Runahead Execution 14:00
Xin Yang¹, Xin Fan², Zengshi Wang¹, Jun Han¹

¹ Fudan University ; ² Hong Kong University of Science and Technology

Deep Neural Networks (DNNs) are widely applied across domains and have shown strong effectiveness. As DNN workloads increasingly run on CPUs, dedicated Matrix Processing Units (MPUs) and Matrix Instruction Set Architectures (ISAs) have been introduced. At the same time, sparsity techniques are widely adopted in algorithms to reduce computational cost. Despite these advances, insufficient hardware–algorithm co-optimization leads to suboptimal performance. On the memory side, sparse DNNs incur irregular access patterns that cause high cache miss rates. While runahead execution is a promising prefetching technique, its direct application to MPUs is often ineffective due to significant prefetch redundancy. On the compute side, stride constraints in current Matrix ISAs prevent the densification of multiple logically related sparse operations, resulting in poor utilization of MPU processing elements. To address these inefficiencies, we propose DARE, an irregularity-tolerant MPU with a Densifying ISA and filtered Runahead Execution. DARE extends the ISA to support densifying sparse operations and equips a lightweight runahead mechanism with filtering capability. Experimental results show that DARE improves performance by 1.04× to 4.44× and increases energy efficiency by 1.00× to 22.8× over the baseline, with 3.91× lower hardware overhead than NVR.
TS23.2 XTree on EquiMesh: Topology and Algorithm Co-Design for Collective Communication 14:05
Junwei Cui¹, Le Qin¹, Weilin Cai², Jiayi Huang¹

¹ The Hong Kong University of Science and Technology (Guangzhou) ; ² Hong Kong University of Science and Technology (Guangzhou)

Mesh topology is widely adopted for both on-chip and chiplet-based interconnects due to its placement-friendly physical layout. However, the low-degree nodes at the edges and corners create bandwidth bottlenecks for common collectives such as AllGather and AllReduce. We address this limitation with EquiMesh, an augmented 2D-Mesh with equivalent-degree nodes without incurring switching complexity. To fully exploit EquiMesh, we propose XTree, a topology-aware algorithm that maximizes utilization of available bandwidth, and MirrorXTree, which constructs ReduceScatter and AllReduce on top of XTree through topology mirroring. Our evaluation shows that EquiMesh with XTree and MirrorXTree achieves 2x and 1.2x higher effective bandwidth than state-of-the-art mesh-based topology-algorithm co-designs for AllGather and AllReduce, respectively.
TS23.3 Beaivi: A 22-nm 1-GHz+ Exposed Datapath RISC-V DSP for Low-Power Applications 14:10
Kari Hepola¹, Joonas Multanen¹, Väinö-Waltteri Granat¹, Jakub Zádník¹, Roope Keskinen¹, Karri Palovuori¹, Pekka Jääskeläinen¹

¹ Tampere University

Low-power digital signal processing is required for edge devices operating in energy-constrained environments. Static multi-issue machines excel in such use cases but lack the required flexibility for maintaining high code density while exploiting instruction-level parallelism. This paper introduces a novel RISC-V-based DSP architecture, ``Beaivi'', that extends the processor with an exposed datapath multi-issue mode for exploiting instruction-level parallelism efficiently in performance-critical code regions while preserving high code density in noncritical phases with a RISC-V mode. The dynamic code density is further improved by leveraging a dictionary compression method that programs the dictionaries on a loop basis via compiler-driven static analysis. We demonstrate the real-world applicability of the architecture by taping out the processor using a commercial 22-nm technology. The design meets timing at 1.0 GHz and draws 50 mW under a neural network inference workload.
TS23.4 T-MSA: Transformer-Driven Multi-Strategy Adaptive Microarchitecture Design Space Exploration 14:15
Jingjing Wang¹, Zihan Lin², Fan Yang², Xiaochuan Li¹, Runze Zhang¹, Cong Xu¹, Rengang Li¹, Baoyu Fan¹

¹ IEIT System Co., Ltd. ; ² Fudan University

The design of modern processors ignores the topological relationships among all design parameters, leading to significant simulation costs wasted on invalid designs. Therefore, we propose the T-MSA to address this issue. It is a Transformer-driven multi-strategy adaptive design space exploration scheme. A customized lightweight Transformer (LiteFormer) is devised to model topological relationships among arbitrary design parameters, constructing an implicit interaction graph in the latent space. Secondly, we design a dynamic active learning (DynamicAL) strategy to extract sparse and high-quality initial points via sparse centroid initialization and hybrid sampling. Finally, a triple Pareto frontier acquisition function (TriPFAF) is devised to guide optimization direction based on gains from three types of Pareto frontiers, dynamically balancing exploration and exploitation. We conducted rigorous experiments on two BOOM evaluation platforms, demonstrating that T-MSA efficiently and comprehensively optimizes the performance-power-area (PPA) objective. The designs it identifies achieve significant improvements over state-of-the-art DSE algorithms on Pareto hypervolume (HV). When attaining the same HV value, T-MSA outperforms BOOM-Explorer by 188.24% and 133.33% on two platforms.
TS23.5 RISC-V ISA Extensions for Vectorized Unstructured Sparse SpMM in LLM Inference 14:20
Tengfei Xia¹, Zhihua Fan¹, Jing Xue¹, Shantian Qin¹, Xiaochun Ye¹, Wenming Li¹

¹ State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences

Unstructured sparsity has emerged as a key enabler for pruning large language models (LLMs) while preserving accuracy. However, its highly irregular pattern makes it notoriously difficult to accelerate, creating severe bottlenecks in vectorization and memory access that prevent efficient deployment on edge hardware with tight power and area constraints. We present SCG, a vectorizable sparse matrix format designed to unlock high-performance unstructured sparse matrix–matrix multiplication (SpMM), the dominant kernel in LLM feed-forward networks and Q/K/V/O projections. To exploit SCG, we introduce custom RISC-V instructions and extend the BOOM processor with two lightweight pipelines for efficient parallel execution. This format–instruction–hardware co-design directly addresses the long-standing challenge of unstructured sparse acceleration in general-purpose processors. On real LLM workloads, our design achieves 11.9×, 12.7×, and 13.4× average speedups over baseline BOOM on LLaMA2-7B, OPT-1.3B, and TinyLLaMA1.1B, respectively, with negligible hardware overhead. Compared to state-of-the-art sparse accelerators, it delivers up to 1.72× higher area efficiency.
TS23.6 From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design 14:25
Jinxin Yu¹, Yudong Pan², Mengdi Wang³, Huawei Li³, yinhe han⁴, Xiaowei Li⁵, Ying Wang⁶

¹ State Key Laboratory of Computer Architecture;Institute of Computing Technologe;University of Chinese Academy of Sciences; ; ² CICS, Institute of Computing Technology, Chinese Academy of Sciences;University of Chinese Academy of Sciences; ; ³ Institute of Computing Technology, Chinese Academy of Sciences ; ⁴ Institute of Computing Technology,Chinese Academy of Sciences ; ⁵ ICT, Chinese Academy of Sciences ; ⁶ State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences

Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 µm vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM round-trips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46–93% energy consumption and achieves 1.4×–7.6× speedups compared to state-of-the-art 2D and 3D designs.
TS23.7 Increasing the Efficiency of Associative Processor Architectures via CMOS-Compatible Hybridization 14:30
Socrates Wong¹, Cecilio Tamarit¹, Mohammad Mehdi Sharifi², Zephan Enciso², Dayane Reis³, Michael Niemier², X. Sharon Hu², José Martínez¹

¹ Cornell University ; ² University of Notre Dame ; ³ University of South Florida

We present a hybrid, general-purpose, associative processing-in-memory architecture that combines the energy and area advantages of a primary FeFET-based CAM array with the write performance and endurance of a much smaller CMOS-based sidekick. The hybrid nature of the architecture is transparent to the programmer, who uses a RISC-V ISA with standard RVV vector extensions. Detailed SPICE- and system-level simulations show our hybrid design dramatically curbs the endurance disadvantages of a pure FeFET design and delivers, on average, 30% and 11% area and energy savings over CMOS, respectively, at a performance loss of barely 1% over CMOS.
TS23.8 PQCUARK: A Scalar RISC-V ISA Extension for ML-KEM and ML-DSA 14:35
Xavier Carril Gil¹, Alicia Manuel Pasoot¹, Emanuele Parisi¹, Oriol Farras², Carlos Andres Lara-Nino², Miquel Moreto³

¹ Barcelona Supercomputing Center ; ² Universitat Rovira i Virgili ; ³ BSC

Recent advances in quantum computing pose a threat to the security of digital communications, as large-scale quantum machines can break commonly used public-key cryptographic algorithms. To mitigate this risk, post-quantum cryptography (PQC) schemes are being standardized, with recent NIST recommendations selecting two lattice-based schemes, ML-KEM for key encapsulation and ML-DSA for digital signatures, alongside other schemes. Two computationally intensive kernels dominate the execution of these schemes: the Number-Theoretic Transform (NTT) for polynomial multiplication and the Keccak-f1600 permutation function for polynomial sampling and hashing. This paper presents PQCUARK, a scalar RISC-V ISA extension that accelerates these key operations. PQCUARK integrates two novel accelerators within the core pipeline: (i) a packed SIMD butterfly unit capable of performing NTT butterfly operations on 2×32-bit or 4×16-bit polynomial coefficients, and (ii) a permutation engine that delivers two Keccak rounds per cycle, hosting a private state and a direct interface to the core load-store unit, eliminating the need for a custom register file interface. We have integrated PQCUARK into an RV64 core and deployed it on an FPGA. Experimental results demonstrate that PQCUARK provides up to 10.1× speedup over the NIST baselines and 2.3× over optimized software, and it outperforms comparable state-of-the-art approaches by 1.4× to 12.3× in performance. ASIC synthesis in GF22 FDSOI technology shows a moderate core area increase of 8% at 1.2 GHz, with PQCUARK units remaining outside the critical path.
TS23.9 NITRO: 3D NAND Flash-Based In-Storage LLM Computing with Enhanced Activation Dataflow 14:40
Sanghun Shin¹, Gisan Ji¹, Sungju Ryu¹

¹ Sogang University

In-storage computing (ISC) has emerged as a next-generation memory architecture to relieve the data movement bottleneck between host processors and memory systems. While recent NAND flash-based processing-in-memory works leverage the high density of 3D NAND flash for deep neural networks, they primarily focus on optimizing computation inside the NAND array. Consequently, these approaches often fail to address the critical latency overhead associated with managing intermediate activation data. To overcome such a limitation, we propose a heterogeneous NAND flash-based ISC architecture with enhanced activation buffering. By buffering intermediate values in a DRAM subsystem rather than programming them into the NAND flash array, our approach effectively mitigates the high programming latency penalties. We also introduce a distributed dataflow scheme that maximizes computational parallelism through optimized plane- and bank-level data mapping. The results show that our proposed architecture achieves performance improvements, reducing inference latency by up to 86% compared to the baseline.
TS23.10 SMIX: Schedulable Instruction Set Architecture Extension Interface for Multi-Operand Operators 14:41
Shufan He¹, Hanmo Wang¹, Kefa Chen¹, Xuyin Chen¹, Xianhua Liu¹, Chun Yang¹

¹ Peking University

Integrating domain-specific operators into processor cores is essential for performance scaling. However, multi-operand operators often face a semantic gap with conventional ISAs, which are limited in operand capacity and scheduling flexibility. This paper presents SMIX, a schedulable instruction set extension interface for multi-operand operators. SMIX decouples execution into three stages: out-of-order input filling, computation, and out-of-order result picking. By employing explicit encoding and counter-based dependency management, SMIX enables both efficient static scheduling by compilers and dynamic out-of-order execution in hardware. Experimental results demonstrate the high schedulability of SMIX, where static scheduling provides an average 12% performance gain on the Rocket core and dynamic out-of-order scheduling contributes an additional 9.2% speedup on the BOOM core, all while maintaining minimal hardware overhead.

SD02 Engineering the Future: Educating Tomorrow?s EDA Engineers in the Generative AI Era (Panel) 14:00 - 15:30 | Auditorium

Chair: Marcello Traiola (Inria, FR)

Co-Chair: Cristiana Bolchini (Politecnico di Milano, IT)

Organizers:

Marcello Traiola (Inria, France)
Cristiana Bolchini (Politecnico di Milano, IT)

SD02.1 Engineering the Future: Educating Tomorrow's EDA Engineers in the Generative AI Era (Panel) 14:00
Francesco Bruschi¹, Maksim Jenihhin², Gunar Schiner³, Natesan Venkateswaran⁴, Marina Zapater⁵

¹ Politecnico di Milano ; ² Tallinn University of Technology ; ³ Northeastern University ; ⁴ IBM ; ⁵ University of Applied Sciences Western Switzerland (HES-SO)

Artificial Intelligence is transforming engineering practice, shifting the role of engineers from coding and routine problem-solving to system design, critical judgment, and human–AI collaboration. This panel explores how these tools are integrated today in education and research and discusses how to evolve education to prepare the future engineers with the technological and societal demands of the AI era.

MPP01 Emerging technologies and applications in multi-partner projects 14:00 - 15:30 | Rigoletto

Chair: Enrico Fraccaroli (North Carolina University at Chapel Hill, US)

MPP01.1 Multi-Partner Project: Collaborative Innovation in 3D VLSI Reliability (COIN-3D) 14:00
George Rafael Gourdoumanis¹, Fotoini Oikonomou¹, Maria Pantazi-Kypraiou¹, Pavlos Stoikos¹, Olympia Axelou¹, Athanasios Tziouvaras¹, Georgios Karakonstantis¹, Tahani Aladwani², Christos Anagnostopoulos², Yixian Shen³, Anuj Pathania³, Alberto Garcia-Ortiz⁴, George Floros⁵

¹ University of Thessaly ; ² University of Glasgow ; ³ University of Amsterdam ; ⁴ ITEM (U.Bremen) ; ⁵ Trinity College Dublin

As semiconductor manufacturing advances from the 3-nm process toward the sub-nanometer regime and transitions from FinFETs to gate-all-around field-effect transistors (GAAFETs), the resulting complexity and manufacturing challenges continue to increase. In this context, 3D chiplet-based approaches have emerged as key enablers to address these limitations while exploiting the expanded design space. Specifically, chiplets help address the lower yields typically associated with large monolithic designs. This paradigm enables the modular design of heterogeneous systems consisting of multiple chiplets (e.g., CPUs, GPUs, memory) fabricated using different technology nodes and processes. Consequently, it offers a capable and cost-effective strategy for designing heterogeneous systems. This paper introduces the Horizon Europe Twinning project COIN-3D (Collaborative Innovation in 3D VLSI Reliability), which aims to strengthen research excellence in 2.5D/3D VLSI systems reliability through collaboration between leading European institutions. More specifically, our primary scientific goal is the provision of novel open-source Electronic Design Automation (EDA) tools for reliability assessment of 3D systems, integrating advanced algorithms for physical- and system-level reliability analysis.
MPP01.2 Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale 14:05
Panagiotis-Eleftherios Eleftherakis¹, Georgios Anagnostopoulos¹, Anastassis Kapetanakis¹, Mohammad Umair², Jean-Yves Vet³, Konstantinos Iliakis¹, Jonathan Vincent⁴, Jing Gong⁴, Akshay Patil⁵, Clara Garcı́a-Sánchez⁵, Gerardo Zampino⁶, Ricardo Vinuesa², Sotirios Xydis¹

¹ National Technical University of Athens ; ² FLOW, Eng. Mechanics, KTH Royal Institute of Technology ; ³ Hewlett Packard Enterprise ; ⁴ PDC Center for High Performance Computing, KTH Royal Institute of Technology ; ⁵ TUDelft ; ⁶ KTH

As heterogeneous supercomputing architectures leveraging GPUs become increasingly central to high-performance computing (HPC), it is crucial for computational fluid dynamics (CFD) simulations, a de-facto HPC workload, to efficiently utilize such hardware. One of the key challenges of HPC codes is performance portability, i.e. the ability to maintain near-optimal performance across different accelerators. In the context of the REFMAP project, which targets scalable, GPU-enabled multi- fidelity CFD for urban airflow prediction, this paper analyzes the performance portability of SOD2D, a state-of-the-art Spectral Elements simulation framework across AMD and NVIDIA GPU architectures. We first discuss the physical and numerical models underlying SOD2D, highlighting its computational hotspots. Then, we examine its performance and scalability in a multi-level manner, i.e. defining and characterizing an extensive full-stack design space spanning across application, software and hardware infrastructure related parameters. Single-GPU performance char- acterization across server-grade NVIDIA and AMD GPU archi- tectures and vendor-specific compiler stacks, show the potential as well as the diverse effect of memory access optimizations, i.e. 0.69× - 3.91× deviations in acceleration speedup. Performance variability of SOD2D at scale is further examined on the LUMI multi-GPU cluster, where profiling reveals similar throughput variations, highlighting the limits of performance projections and the need for multi-level, informed tuning.
MPP01.3 Multi-Partner Project: CeCaS Accelerator Design for Efficient Supercomputing in Automotive Systems 14:10
Annina Gutermann¹, Alexey Serdyuk¹, Fabian Lesniak¹, Julian Hoefer², Hella Toto Kiesa¹, Tanja Harbaum³, Juergen Becker⁴, Brian Pachideh⁵, Sven Nitzsche⁵, Moritz Neher⁵, Carmen Weigelt⁶, Jann Krausse⁶, Victor Pazmino⁵, Klaus Knobloch⁶, Lukas Groth⁷, Andrija Neskovic⁸, Saleh Mulhem⁹, Mladen Berekovic¹⁰

¹ Karlsruhe Institute of Technology (KIT) ; ² Karlsruhe Institute of Technology ; ³ KIT ; ⁴ Karlsruhe Institute of Technology - ITIV ; ⁵ FZI Research Center for Information Technology ; ⁶ Infineon Technologies ; ⁷ Institute of Computer Engineering (ITI), Universität zu Lübeck ; ⁸ University of Luebeck ; ⁹ Institute of Computer Engineering, Universität zu Lübeck ; ¹⁰ Universität zu Lübeck

Modern vehicles integrate an increasing amount of computational functionality, driven by the growing complexity of in-vehicle applications. At the same time, automotive system architectures are becoming more centralized, requiring powerful HPC platforms at the core. These platforms must deliver the performance needed for ADAS, AI, and autonomous driving, while also meeting stringent energy efficiency and safety requirements. The CeCaS project addresses these challenges across a wide range of topics and domains of expertise, including processor design in advanced FinFET technology, the transformation of the E/E architecture, and advanced packaging for automotive supercomputing platforms. Within CeCaS, our work focuses on application-specific accelerator design to enable efficient processing of compute-intensive workloads. In this paper, we present our contributions in this area, including the design of hardware accelerators for both conventional and neuromorphic AI workloads, the development and evaluation of representative AI benchmarks, and the use of virtual platforms for early design-space exploration and hardware/software co-design.
MPP01.4 Multi-partner project: Ecodesign to reduce electronic waste 14:15
Chiara Sandionigi¹, Olivia Belorgeot², Elise Chaumat³, Jean-Christophe Crebier⁴, George Dimitrakopoulos⁵, Lena Froger⁶, Thomas Krivec⁷, Michel Monsellier⁶, Lucas Pinto⁴, Ioannis Vondikakis⁵, Christof Wernbacher⁷, Edmund Whitmore⁸, Olivier Pedoussaut⁶

¹ CEA-LIST ; ² Schaeffler ; ³ CEA ; ⁴ INP-Gre ; ⁵ HUA ; ⁶ Dassault Systemes ; ⁷ AT&S ; ⁸ 4MOD

This paper presents the Chips JU project EECONE, a European initiative dedicated to reducing electronic waste. We outline the project's overarching goals and highlight ongoing work focused on integrating circularity principles into the design phase of electronics. Our research aims to develop methods and tools that enable designers to assess environmental impacts and embed circular economy strategies from the outset. We also examine the data and metrics essential for effective ecodesign and demonstrate early results from the EECONE ecodesign platform through real world case studies. These findings illustrate the potential of design stage interventions to drive environmental sustainability across the electronics value chain.
MPP01.5 Multi-Partner Project: Scheduling-Deployment Workflow for Autonomous RoboRacer Driving Stacks in the HAL4SDV Project 14:20
Matthias Stammler¹, Henrik Scheidt¹, Tanja Harbaum², Juergen Becker³, Konstantin Dudzik⁴, Victor Pazmino⁵, Federico Gavioli⁶, Paolo Burgio⁶, Arvind Easwaran⁷, Andreas Eckel⁸

¹ Karlsruher Institut fuer Technologie ; ² KIT ; ³ Karlsruhe Institute of Technology - ITIV ; ⁴ Forschungszentrum Informatik ; ⁵ FZI Research Center for Information Technology ; ⁶ University of Modena and Reggio Emilia ; ⁷ Nanyang Technological University ; ⁸ TTTech Computertechnik

The European-funded HAL4SDV project aims to advance European solutions in software-defined vehicles by introducing a hardware abstraction layer positioned between executed software and execution units. HAL4SDV includes over 60 partners across 12 countries and receives funding within the Chips Joint Undertaking under Horizon Europe since April 2024 and is coordinated by TTTech Computertechnik. The proposed hardware abstraction layer includes safety-critical scheduling and platform deployment of software tasks, and is motivated by the requirement for abstracted hardware with unified interfaces in centralized automotive architectures. This work presents a correct-by-construction workflow which is developed by academic partners to schedule and deploy periodic software tasks onto diverse execution units. The workflow facilitates the execution of the same task stack on multiple unit architectures and consists of a task model and scheduling algorithm, which is followed by platform deployment for diverse hardware units, ensuring safe execution. In this multi-partner project, a bandwidth regulation unit for hardware accelerators and a RISC-V-based multicore system with tightly coupled memories are used as target platforms. A RoboRacer driving stack is chosen for evaluation, showing the viability of our workflow to schedule autonomous driving functions. To show generalization capability, synthetic task sets are additionally used to validate our deployment workflow.
MPP01.6 Multi-Partner Project: Advancing European Semiconductor and Chiplet Innovation Through the Bavarian Chip Design Center 14:25
Hussam Amrouch¹, Jehaan Joseph¹, Michael Schirmer¹, Johannes Geier¹, Ulf Schlichtmann¹, Michael Meidinger¹, Thomas Wild¹, Andreas Herkersdorf², Jens Nöpel¹, Georg Sigl³, Yicheng Zhang¹, Carsten Trinitis¹, Aswathy Sankaranarayanan¹, Martin Schulz¹, Andreas Korb⁴, Konrad Hohentanner⁵

¹ Technical University of Munich ; ² TU München ; ³ Technical University of Munich/Fraunhofer AISEC ; ⁴ Fraunhofer AISEC ; ⁵ Rohde & Schwarz

Europe's semiconductor industry relies heavily on Asian and US manufacturers. The EU Chips Act seeks to strengthen Europe's capabilities across the semiconductor value chain. Aligned with this goal, the Bavarian Chip Design Center (BCDC) supports local chip design, manufacturing, and talent development, with a focus on RISC-V computing and heterogeneous integration. Within BCDC, the Technical University of Munich and Fraunhofer are developing a chiplet-based architecture optimized for low-power edge AI. The system integrates two chiplets, combining a security-enhanced RISC-V core and AI accelerators, connected via a chiplet-optimized serial interface that supports encrypted data. The chiplets are mounted on a custom interposer with low-capacitance wires for efficient data transmission. System- and component-level development is currently ongoing, with a tapeout in 22 nm FD-SOI planned for 2027. The overall goal is to deliver a proof of concept for a small-scale energy-efficient chiplet system that demonstrates Bavaria's and Europe's capability to drive innovation in novel chip design fields.
MPP01.7 Multi Partner Project: Stratum, co creation protocol and advanced smart GUI for a 3D neurosurgery supporting tool 14:30
Emanuele Torti¹, Himar Fabelo², Elisa Marenzi¹, Maria Luisa Alvarez-Male³, Chrysanthi Bairaktari⁴, Beatriz Noriega-Ortega⁴, Raquel Leon-Martin⁵, Santiago Marco⁶, Asaf Badouh⁷, Max Verbers⁸, Carlos Vega⁵, Javier Santana-Nuñez⁹, Yolanda Ramallo-Fariña¹⁰, Christian Weis¹¹, Ana Maria Wägner¹², Eduardo Juarez¹³, Claudio Rial¹⁴, Alfonso Lagares¹⁵, Gustav Burström¹⁶, Luis Jimenez-Roldan¹⁵, Teresa Cervero⁷, Miquel Moreto¹⁷, Giovanni Danese¹, Svitlana Zinger⁸, Francesca Manni⁸, Miguel A. García-Bello¹⁸, Lidia Garcia¹⁹, Jesus Morera²⁰, Juan F. Piñeiro Martí²⁰, Bernardino Clavo²¹, Francesco Leporati¹, Gustavo M. Callico⁵

¹ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy ; ² Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain, Research Unit, Hospital Universitario de Gran Canaria Dr. Negrin, Las Palmas de Gran Canaria, Spain Institute for Applied Microelectronics (IUMA), Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain ; ³ Instituto de Investigaciones Biomédicas y Sanitarias, Univers. de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain ; ⁴ European Citizen Science Association, Berlin, Germany ; ⁵ Institute for Applied Microelectronics (IUMA), Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain ; ⁶ Barcelona Supercomputing Center, Barcelona, Spain - Universitat Politècnica de Catalunya, Barcelona, Spain ; ⁷ Barcelona Supercomputing Center, Barcelona, Spain ; ⁸ Department of Electrical Engineering, Eindhoven University of Technology (TU/e), Eindhoven, The Netherlands ; ⁹ Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain - Research Unit, Hospital Universitario de Gran Canaria Dr. Negrin, Las Palmas de Gran Canaria, Spain - Institute for Applied Microelectronics (IUMA), Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain ; ¹⁰ Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain - Red de Investigación en Cronicidad, Atención Primaria y Promoción de la Salud (RICAPPS), Tenerife, Spain ; ¹¹ University of Kaiserslautem-Landau, Kaiserslautern, Germany ; ¹² Instituto de Investigaciones Biomédicas y Sanitarias, Univers. de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain - Complejo Hospitalario Universitario Insular Materno-Infantil de Canarias, Las Palmas de Gran Canaria, Spain ; ¹³ Universidad Politecnica de Madrid, Madrid, Spain ; ¹⁴ OPTOMIC España S.A., Madrid, Spain ; ¹⁵ Department of Neurosurgery, Hospital Universitario 12 Octubre, Department of Surgery, Medicine Faculty, Universidad Complutense de Madrid, Instituto de Investigaciones Sanitarias (imas12), Madrid, Spain ; ¹⁶ Department of Neurosurgery, Karolinska University Hospital, Stockholm, Sweden ; ¹⁷ Barcelona Supercomputing Center, Barcelona, Spain, Universitat Politècnica de Catalunya, Barcelona, Spain ; ¹⁸ Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain, Red de Investigación en Cronicidad, Atención Primaria y Promoción de la Salud (RICAPPS), Tenerife, Spain ; ¹⁹ Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain ; ²⁰ Department of Neurosurgery, Hospital Universitario de Gran Canaria Dr. Negrin, Las Palmas de Gran Canaria, Spain ; ²¹ Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain, Research Unit, Hospital Universitario de Gran Canaria Dr. Negrin, Las Palmas de Gran Canaria, Spain

STRATUM is a Horizon Europe multi-partner project developing a clinically validated, real-time 3D decision support tool for brain tumour surgery. The system integrates Hyperspectral Imaging (HSI), AI-based multimodal data fusion, and heterogeneous High-Performance Computing (HPC) architectures combining Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), and Processing-In-Memory (PIM) technologies. A touchless augmented reality interface facilitates safe and intuitive intraoperative interaction. The distinguishing characteristic of STRATUM is its end-to-end co-designed approach, which integrates advanced computing, state-of-the-art imaging and clinical expertise into a unified Point-of-Care (PoC) platform. Utilising a structured co-creation methodology involving surgeons, engineers, and social scientists, the project ensures usability, safety and regulatory compliance from its early design stages to its clinical validation. The usability of STRATUM will be tested in three hospitals located in different European regions with diverse conditions and regulations. This will allow to collect advice and remarks from surgical staff in a continuous co-creation and co-tuning protocol. Beyond its clinical objectives, STRATUM contributes to the advancement of heterogeneous computing for real-time diagnostics, AI acceleration in critical medical environments and energy-efficient system integration. Furthermore, it delivers open datasets, validated AI pipelines, and performance benchmarks with a view to fostering future research and industrial innovation in digital surgery. The STRATUM project establishes a replicable model for intelligent, human-centred computing integrating microelectronics, AI and medicine. The paper presents an overview of the project in terms of aims, concepts and technologies and the description of the state of the work when approaching the end of the second of the five years planned. Specifically, the outcomes of the steps related to the collaboration with surgeons and medical staff (co-creation process) and the intelligent Graphical User Interface (GUI) development will be described. The latter allows for contactless interaction of the surgeon with several functions that have already been developed in the system.
MPP01.8 Multi-Partner Project: UP2DATE4SDV -- Enabling Safe and Secure Modular Updates, Upgrades and Dynamic Task-Reallocation and -Execution for Software Defined Vehicles 14:35
Gregor Nitsche¹, Patrick Uven¹, Hannes Fuchs², Enrico Mezzetti³, Marcus Hähnel⁴, Ane Mijangos⁵, Kim Gruettner⁶

¹ German Areospace Center (DLR) - Institute of Systems Engineering for Future Mobility ; ² AVL List GmbH ; ³ Barcelona Supercomputing Center (BSC) ; ⁴ Technische Universität Dresden ; ⁵ IKERLAN, S. Coop. ; ⁶ DLR e.V.

The European automotive industry is undergoing a revolution by the upcoming technologies of software-defined vehicles (SDV) and connected cooperative and automated mobility (CCAM). In a globally challenging context, in which Europe lost market share, the local automotive software and electronics market still expects a 11.9% compound annual growth rate from 2025 to 2030 – even more accentuated with the increasing adoption of advanced driver assistance (ADAS) and autonomous driving (AD). Both SDV and CCAM are key technological paradigms for enabling a shift of the European automotive sector towards a regained strategic competitiveness. Targeting the resulting need for faster development, deployment and test cycles the Horizon Europe RIA UP2DATE4SDV aims to develop a comprehensive ecosystem for seamless and efficient, safe and secure software updates, hardware upgrades, and situation-dependent reconfigurations of SDVs. For that goal, the UP2DATE4SDV consortium collaborates on the definition and development of two abstraction layers – the hardware abstraction layer and the operating system & middleware abstraction layer – as well as on researching and prototyping a safe and secure orchestration and reconfiguration plane between vehicle and cloud. Based on the resulting modular architecture concept and the corresponding DevOps process the project develops demonstrators to showcase safe and secure updates, upgrades and dynamic task reallocation for automotive hardware and software components. In this paper, we introduce the project, its objectives and planned results, draft first outcomes by refining our demonstrator definitions, and conclude with an outlook into the automotive future based on a safe and secure adaptive SDV stack.

FS05 Focus Session: Beyond Conventional Hardware Security: Next-Generation Design and Security Evaluation for Hardware Architectures (HotTopic) 14:00 - 15:30 | Figaro

Modern computing systems depend on the trustworthiness of their underlying hardware, par-ticularly the increasingly complex System-on-Chip (SoC) and CPU architectures that power de-vices across cloud, edge, and IoT environments. The attack surface has expanded dramatically as these architectures integrate heterogeneous components into single, compact dies, includ-ing processors, accelerators, and third-party IPs. In recent years, academia and industry have reported numerous security-critical vulnerabilities in commercial SoCs and CPUs, underscoring the urgent need for scalable, automated techniques to detect hardware-level flaws before fabrication. Unlike software vulnerabilities, hardware flaws are immutable once silicon is produced, making pre-silicon security validation technically essential and economical-ly critical. At the same time, the exponential rise of Artificial Intelligence (AI) has revolutionized compu-ting, driving an unprecedented demand for specialized hardware such as GPUs and NPUs. These accelerators now form the backbone of AI infrastructure used by governments, enter-prises, and research institutions worldwide. Ensuring the security and integrity of these AI hardware platforms is crucial, as their compromise could undermine model reliability, data confidentiality, and national-scale AI systems. The hardware foundations of AI must therefore be secure, verifiable, and trustworthy to maintain the current pace of innovation. Interestingly, this technological convergence also gives rise to a new paradigm, AI for hard-ware security. Researchers have recently begun applying AI-driven methodologies to enhance security validation, automate design-space exploration, and detect vulnerabilities more effi-ciently. Leveraging learning-based models, adaptive search strategies, and generative tech-niques, AI is now reshaping how hardware security analysis is performed. However, these same advancements also introduce novel risks, such as adversarial manipulation of AI-guided verification tools or unintended bias in automated test generation. This focus session will present diverse perspectives from academia and industry on advanced hardware security analysis techniques, the potential and challenges of using AI to advance hardware security evaluation, and emerging risks and mitigation strategies that accompany this evolution. Motivated by long-term community efforts such as the HackTheSilicon hardware security com-petitions, which, since 2018, have engaged in discovering vulnerabilities in realistic SoCs, this session emphasizes how hands-on insights and open evaluation frameworks are shaping the next generation of secure hardware design. Together, these experiences highlight the growing need for intelligence-driven, scalable, and proactive security validation frameworks to protect the hardware that creates modern computing and AI ecosystems. This Focus Session brings together leading experts from academia and industry to explore emerging methodologies, case studies, and research challenges in next-generation hardware security validation, focusing on: 1. Emerging vulnerabilities and security validation challenges in SoCs, CPUs, and AI accel-erators. 2. Hybrid hardware verification techniques like formal-fuzzing and AI-guided valida-tion techniques integrate symbolic reasoning with adaptive automation. 3. AI-assisted hardware security techniques include learning-based fuzzing, predictive bug detection, and automated verification. 4. Risks introduced by AI in hardware security workflows and potential countermeasures for safe adoption. 5. Collaborative directions toward integrating security validation into standard hardware design and verification pipelines. Through these discussions, the session will deliver a holistic industry and academia view of advanced hardware security verification methods, how AI depends on and enables hardware security, and offer a roadmap for developing trustworthy systems.

Chair: Stjepan Picek (University of Zagreb & Radboud University, NL)

Co-Chair: Nele Mentens (Leiden University and KU Leuven, BE)

Organizers:

Stjepan Picek (University of Zagreb & Radboud University, NL)

FS05.1 Emerging Threats in SoCs, CPUs, and AI Accelerators 14:05
Rosario Cammarota¹

¹ Intel Labs

Modern SoCs, CPUs, and AI accelerators are increasingly complex, integrating heterogeneous components that expand the attack surface. This presentation will highlight recent security-critical vulnerabilities, illustrate real-world examples, and discuss the emerging threat land-scape. Attendees will gain insights into the types of attacks targeting both conventional and AI-specialized hardware, as well as the challenges in detecting and mitigating these threats.
FS05.2 Advanced Hybrid Hardware Verification Techniques 14:25
Chen Chen¹, Stephen Muttathil¹, Mohamadreza Rostami², Nikhilesh Singh², Lichao Wu³, Ahmad-Reza Sadeghi⁴, Jeyavijayan Rajendran¹

¹ Texas A&M University ; ² Technical University of Darmstadt ; ³ University of Bristol ; ⁴ Technische Universitaet Darmstadt

While hardware fuzzing has proven effective in uncovering vulnerabilities, certain design spac-es that require specific conditions remain challenging to explore due to its inherently random nature. This session will first present an example of a hard-to-reach design space in CPUs. It will then demonstrate how the conditions of this design space can be translated into inputs for formal verification tools. Finally, a quantitative analysis will be provided, highlighting the po-tential of using formal methods as a catalyst to generate targeted inputs for more efficient and comprehensive hardware fuzzing.
FS05.3 Advanced AI-assisted Hardware Security Techniques 14:45
Nikhilesh Singh¹, Mohamadreza Rostami¹, Lichao Wu², Chen Chen³, Stephen Muttathil³, Jeyavijayan Rajendran³, Ahmad-Reza Sadeghi⁴

¹ Technical University of Darmstadt ; ² University of Bristol ; ³ Texas A&M University ; ⁴ Technische Universitaet Darmstadt

This session will explore advanced hardware fuzzing frameworks, including coverage-guided, AI-assisted, and hybrid approaches. The presenter, co-founder of HackTheSilicon, will provide real-world insights on the future of hardware verification techniques, demonstrating how these methods systematically uncover vulnerabilities in complex hardware designs and en-hance security validation.
FS05.4 Focus Session: Exploring Semantic Leakage in Edge FPGA Implementations of Neural Networks 15:05
Zhuoran Liu¹, Konstantina Miteloudi¹, Durba Chatterjee¹, Lejla Batina²

¹ Radboud University ; ² Radboud University Nijmegen

Edge neural network implementations can be substantially accelerated on FPGAs. Open-source tools like FINN enable real-world deployment of applications in various domains. However, the privacy and security of FPGA-based edge neural network implementations have often been overlooked. Semantic leakage, a new type of side-channel vulnerability, has been identified in both software and hardware neural network implementations. In this paper, we provide an initial analysis of FPGA implementations of convolutional neural networks (CNNs) generated using the open-source FINN framework. Our work follows the recent semantic-leakage threat model, in which the adversary aims to differentiate between categories of input data based on side-channel leakage. We mount the side-channel attack on CNNs compiled with FINN for AMD ZCU104 FPGA and show that FINN-generated designs exhibit such leakage. To further explore how leakage varies, we tune various implementation aspects, including storage elements, arithmetic operations in computation elements, and folding. Our experiments demonstrate that implementing the arithmetic operations and storage elements using look-up tables (LUTs) may be less vulnerable to semantic leakage than using the specific-purpose FPGA blocks. More importantly, we show that more folding transformations enhance resistance against semantic leakage.

W02 4th Workshop on Nano Security: From Nano-Electronics to Secure Systems 14:00 - 18:00 | Bohème

(Re-)Thinking Hardware Security in the Nanoscale Age
Francesco Regazzoni¹

¹ University of Amsterdam and Università della Svizzera italiana

TS24 Formal Verification and Automated Testing 16:30 - 18:00 | Traviata

Chair: Stefano Quer (Politecnico di Torino, IT)

Co-Chair: Daniel Grosse (Johannes Kepler University Linz, AT)

TS24.1 RLConcolic: Enhancing Concolic Testing via Multi-Step Reinforcement Learning 16:30
Yan TAN¹, Xiangchen Meng¹, Yangdi Lyu²

¹ Hong Kong University of Science and Technology(Guangzhou) ; ² Hong Kong University of Science and Technology (Guangzhou)

Chip manufacturing relies on rigorous verification to prevent costly design errors before fabrication and deployment. Branch coverage, a key metric for Register-Transfer Level (RTL) validation, ensures thorough testing of decision points in the design. However, RTL designs often contain numerous hard-to-activate branches, which can lead to hidden bugs and security vulnerabilities. While concolic testing addresses the memory explosion issues associated with formal methods, it relies on heuristics that may get stuck in local optima. In this paper, we propose a novel approach that reformulates Concolic testing as a reinforcement learning problem. Our method utilizes the agent that takes into account RTL structural characteristics and runtime simulation states to select strategies for guiding the simulation path toward target branches. Experimental results demonstrate that our approach effectively directs simulations toward branch targets, reduces search redundancy, and significantly increases branch coverage, thereby improving the efficiency and effectiveness of the test generation process.
TS24.2 Fine-Grained Code Analysis for Processor Fuzzing 16:35
Ziyue Zheng¹, Zhi Qu², Yangdi Lyu²

¹ Hong Kong University of Science and Technology (GuangZhou) ; ² Hong Kong University of Science and Technology (Guangzhou)

The increasing complexity of modern processor designs has posed significant challenges in achieving comprehensive coverage metrics for functional verification of Register-Transfer Level (RTL) designs. Despite the availability of white-box RTL models, recent advancements in hardware fuzzing have predominantly focused on grey-box methodologies, which lack effective utilization of internal logic and structural information. This paper presents a novel approach that addresses this limitation by extracting control flow graphs (CFGs) from processor designs and analyzing the dependencies within these graphs. The analyzed CFGs serve as heuristic information to guide the generation of processor stimuli. By effectively leveraging internal logic information during the simulation of complex processors, this method provides interpretable heuristics for test generation. Experimental results demonstrate the effectiveness of utilizing control flow information derived from processor designs in enhancing the convergence speed of coverage metrics and guiding test sequences towards hard-to-reach states.
TS24.3 Validating Formal Hardware Specifications Through Generated Behavioral Models 16:40
Robert Kunzelmann¹, Zeyad Tahoun², Vinod Bangalore Ganesh³, Maximilian Berger¹, Emil Baerens¹, Wolfgang Ecker⁴

¹ Infineon Technologies AG, Technical University of Munich ; ² Infineon Technologies AG, Politecnico di Torino ; ³ Infineon Technologies AG, Technical University of Dresden ; ⁴ Infineon Technologies

Developing large-scale integrated circuits starts with specifying the desired system behavior. While specifications must be precise and dependable for both design and verification purposes, traditional specifications rely on natural language documents and their human interpretation. This introduces two critical weaknesses: First, interpretation can be challenging due to vagueness and inherent ambiguity. Second, validating that a natural language specification expresses the intended behavior is hardly possible with deterministic methods. To tackle these challenges, we use a formal specification format, called the Universal Specification Format (USF), with unambiguous syntax and semantics. USF applies to the specification of general digital hardware and automatically generates formal properties for design verification. Still, it must be ensured that the formal specification - and thus the generated properties - correctly express the desired system behavior. In this paper, we present a novel code generator for behavioral simulation models to execute USF specifications and validate them against use cases. Moreover, we introduce and integrate runtime checks into the simulations that automatically detect inconsistencies and gaps in the specification. This methodology has been applied to industrial-strength hardware components and their formal specifications, demonstrating the effectiveness and industry-readiness of our behavioral simulation models and the automated runtime checks. We finally show that USF enables reusable code generators for both simulation-based specification validation and formal design verification.
TS24.4 FORWORD: Accelerating Formal Datapath Verification via Word-Level Sweeping 16:45
Ziyi Yang¹, Guangyu Hu², Xiaofeng Zhou², Mingkai Miao¹, Changyuan Yu¹, Wei Zhang², Hongce Zhang¹

¹ Hong Kong University of Science and Technology(Guangzhou) ; ² Hong Kong University of Science and Technology

Modern circuit design process increasingly adopts high-level hardware construction languages and parameterized design methodologies to shorten development cycles and maintain high reusability, in contrast to traditional hardware description languages. Such designs often involve complex datapath with arithmetic operations, wide bit-vectors, and on-chip memories, whose scale and level of modeling often pose significant challenges to formal datapath verification. Traditional bit-level SAT sweeping techniques lack the necessary abstraction and adaptability that are required to establish equivalence at a higher level. In this paper, we propose FORWORD, a novel word-level sweeping verification engine tailored explicitly to formal datapath verification. FORWORD integrates randomized and constraint-driven word-level simulations, leveraging adaptive optimization to dynamically refine equivalent candidates identified during simulation. Experimental results demonstrate that FORWORD significantly outperforms state-of-the-art bit-level SAT sweeping engines and the monolithic SMT solving method, thanks to its enhanced capability in effectively identifying equivalent pairs. To the best of our knowledge, FORWORD is the first word-level sweeping engine explicitly designed for datapath verification, offering improved efficiency and adaptability to modern circuit designs.
TS24.5 VLIM: Verified Loop Interchange for Optimised Matrix Multiplication 16:50
Oliver Turner¹, Shounak Chakraborty¹

¹ Durham University

Loop optimisations are essential for achieving high performance in modern computing, particularly for memoryintensive operations. However, while unverified optimisers achieve impressive speedups, their manual application is error-prone and challenging to verify, making them risky in high-assurance computing platforms. This paper introduces VLIM, a novel rewrite algebra, to overcome these difficulties, enabling the development and automatic verification of loop transformations within the Capla programming language, a formally defined front-end for the Compcert verified compiler. Our framework allows compiler developers to define rewrite rules, with correctness proofs automatically derived through rewrite composition, ensuring semantic preservation during optimisation. We demonstrate the effectiveness of our approach, VLIM, by implementing a loop interchange optimisation and evaluating its impact on matrix multiplication performance. Empirical analyses show significant performance improvements: for a 1000 × 1000 matrix, loop interchange using VLIM reduced runtime by 36.6% and 74.6% when compiled with Compcert and Clang, respectively. This work advances the state-of-the-art in verified compilation, offering a promising direction for developing high-performance, formally verified software.
TS24.6 Polynomial Verification of 2-Affine Spaces 16:55
Anna Bernasconi¹, Valentina Ciriani², Gianmarco Cuciniello³, Caroline Dominik⁴, Rolf Drechsler⁵

¹ Universita' di Pisa ; ² Universita' degli Studi di Milano ; ³ Università degli Studi di Milano ; ⁴ University of Bremen ; ⁵ University of Bremen/DFKI

Polynomial Formal Verification (PFV) ensures that a class of circuits can be verified efficiently, by calculating polynomial upper bounds for the resource demands of the verification process. In this paper, we address PFV of Boolean affine spaces represented by a 2-XOR sum of products. We demonstrate that time and space resources stay quadratic, in the number of input variables, during the entire verification process. Specifically, we prove that the dimensions of ROBDDs and QRBDDs representing a 2-affine space are linear. Furthermore, we prove that all ROBDDs generated during the symbolic simulation of the circuit can be computed in linear time. Finally, we provide an overall quadratic upper bound for the formal verification of QRBDD-based circuits. Experimental results confirm the given bounds.
TS24.7 Hyperplane Input Space Cuts for Neural Network Verification 17:00
Jonathan Hjort¹, Ahmed Rezine¹

¹ Linköping University

To achieve tighter bounds on output neurons, previous input space Branch-and-Bound based approaches have used heuristics that determine which input dimension the problem will be split on. In this paper, we present a new technique for splitting the input space with respect to arbitrary input space hyperplanes. For ReLU Neural Networks, this allows us to guide the splitting to obtain problems with more linear neurons as tighter bounds can be obtained with off-the-shelf symbolic interval propagation techniques. Our proposed approach makes use of symbolic bounds for ambiguous ReLU neurons to construct a new basis for the input space, allowing us to force a neuron to be linear in the resulting subproblems. Effectively, this requires us to split the input space with respect to arbitrary hyperplanes, not only parallel to the axes of the input dimensions. This, combined with remembering the bounds of neurons from previous analyses, allows us to show that properties hold on neural networks having to split the problem up to two order of magnitude fewer times than traditional input space Branch-and-Bound based tools.
TS24.8 Provable Guarantees in Approximate Synthesis 17:05
Kushagra Gupta¹, Priyanka Golia¹, Subhajit Roy², Kuldeep S Meel³

¹ Indian Institute of Technology Delhi ; ² IIT Kanpur ; ³ University of Toronto

Automated synthesis techniques generate systems—such as functions or circuits—that provably satisfy a formal specification. Traditional synthesis frameworks often adopt an all-or-nothing approach: either the system satisfies all constraints, or synthesis fails. However, in many practical settings, such strict completeness is either infeasible or too costly to achieve, especially in terms of resources like time, memory, or circuit area. This work addresses such scenarios by moving beyond the all-or-nothing paradigm. We propose a novel synthesis framework that distinguishes between hard constraints, which must be strictly satisfied, and soft constraints, which may be relaxed. The goal is to synthesize systems that provably satisfy all hard constraints while achieving a user-defined threshold of satisfiability on the soft constraints. We quantify this relaxation using a satisficing measure, such as accuracy—i.e., the proportion of inputs for which the system satisfies all constraints. Our approach integrates AI-based methods to generate candidate systems and automated reasoning techniques to ensure formal guarantees. Through extensive experiments, we show that our framework significantly reduces synthesis time compared to traditional approaches. Moreover, the synthesized systems (e.g., circuits) tend to be smaller, connecting our method naturally to the domain of approximate circuit synthesis. Unlike existing approximate synthesis techniques, our framework provides formal guarantees on both correctness (for hard constraints) and quality (for soft constraints).
TS24.9 Zero-shot Diagnosis of Compound Faults Based on Circuit Operational Mechanism 17:10
Zhongyu Gao¹, Aibin Yan², Chunjiong Zhang³, Jehad Ali³, Gaoyang Shan³, Xiaoqing Wen⁴, Patrick Girard⁵

¹ Anhui university ; ² Hefei University of Technology ; ³ Ajou University ; ⁴ Kyushu Institute of Technology ; ⁵ University of Montpellier/CNRS

Existing fault diagnosis schemes for analog circuits rely on comprehensive fault data, posing significant limitations in industrial applications. On the one hand, compound faults in circuits arise from the coupling of multiple single faults (SFs), leading to a scarcity of fault samples compared to SFs. On the other hand, the number of compound fault categories grows exponentially compared to SFs, making it impossible to collect sufficient and comprehensive data. This study focuses on the zero-shot diagnosis task under real-world conditions, aiming to achieve accurate diagnosis of compound faults by solely utilizing SF data. To this end, based on the circuit operation mechanism, we extract fault patterns from SF data. Subsequently, we utilize both the SF data and the extracted fault patterns to generate high-quality pseudo-compound fault data. Finally, the final diagnostic decision is derived through dynamic fusion of classification results based on similarity assessment. Multiple analog circuits-based experiments with varying complexity validate the universality and effectiveness of the proposed scheme, achieving accuracies of 68.43%, 74.70%, and 73.27%, respectively, without using any composite fault data.
TS24.10 A Formally Verified Secure Caching Mechanism on TrustZone-enabled Microcontrollers 17:11
Salvatore Bramante¹, Matteo Busi², Alessandro Cilardo³, Riccardo Focardi², Flaminia Luccio², Stefano Mercogliano³

¹ IMT School for Advanced Studies ; ² Ca' Foscari University of Venice ; ³ University of Naples Federico II

Trusted Execution Environments (TEEs) on resource- constrained microcontrollers are an emerging area of interest, yet they present unique security challenges, particularly in managing encrypted code execution through limited secure memory. This paper presents a formal verification approach for Umbra, a TEE framework for ARM TrustZone-M, currently under development, that implements secure caching mechanisms to execute encrypted enclaves from flash memory. We employ model checking tech- niques to formally analyze critical security properties, including data isolation between secure and non-secure worlds, integrity of the Enclave Flash Block Cache (EFBC), and resilience against identified threats such as Direct Memory Access (DMA) handover attacks and timing-based side channels. Our threat model consid- ers privileged attackers in the non-secure world and compromised host operating systems, analyzing vulnerabilities in DMA recon- figuration windows and context switch dependencies. Through formal modeling, we identify replay and timing side-channel attacks; by introducing countermeasures, these guarantees are restored in the model.

TS25 Efficient AI Systems, Model Compression, and Edge Optimisation 16:30 - 18:00 | Tosca

Chair: Georgios Zervakis (National Technical University of Athens, GR)

Co-Chair: Annachiara Ruospo (Politecnico di Torino, IT)

TS25.1 AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism 16:30
Wendong XU¹, Chujie CHEN², He XIAO³, Kuan LI⁴, Jing XIONG³, Chen ZHANG³, Wenyong ZHOU³, Chaofan TAO³, Yang BAI⁵, Bei Yu⁵, Ngai WONG³

¹ HKU ; ² Institute of Computing Technology ; ³ The University of Hong Kong ; ⁴ The Hong Kong University of Science and Technology ; ⁵ The Chinese University of Hong Kong

Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service interfaces. In typical failure scenarios, AnchorTP reduces Time to First Success (TFS) by up to 11$\times$ and Time to Peak (TTP) by up to 59\% versus restart-and-reload.
TS25.2 DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets 16:35
Sanggeon Yun¹, Hyunwoo Oh¹, Ryozo Masukawa¹, Mohsen Imani²

¹ University of California, Irvine ; ² University of California Irvine

Decomposition is a proven way to shrink deep networks without changing input-output dimensionality or interface semantics. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind–bundle–score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1–0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to ~97% fewer trainable parameters, and—in hardware—delivers roughly 277×/35× energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5×/3.7× over a GPU (NVIDIA RTX 4090), and 2.0×/2.4× over a baseline HDC ASIC.
TS25.3 FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems 16:40
Tianqi Zhang¹, Flavio Ponzina², Tajana Rosing¹

¹ UCSD ; ² San Diego State University

Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4× and improves the throughput by up to 9× than SOTA GPU ANNS system.
TS25.4 Entropy Sampling-Based Neural Architecture Search for Resource-Constrained Microcontroller Targets 16:45
Christian Heidorn¹, Frank Hannig¹, Dominik Riedelbauch², Christoph Strohmeyer², Jürgen Teich¹

¹ Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) ; ² Schaeffler Technologies AG & Co. KG, Herzogenaurach

Neural architecture search (NAS) is a popular approach for the exploration of neural network (NN) architectures. Recently proposed hardware-aware NAS techniques even take resource constraints, such as FLOP count and number of weights, into account. Still, in typical NAS search spaces, a significant portion of candidate NNs may be infeasible when it comes to satisfying tight memory (i.e., RAM and ROM) and timing constraints, particularly in the case of microcontroller targets. As evaluating each design point can be quite time-intensive, we first show how to pre-process a given design space to a reduced set of only feasible (resource constraint fulfilling) solutions, and then efficiently sampling from this set of only feasible solutions by proposing an entropy-based sampling technique and the optimization goal to maximize accuracy. We demonstrate that our approach is able to find feasible solutions with similar accuracy to other hardware-aware NAS techniques, but already after a much lower number of model evaluations, with examples taken from the MLPerf Tiny Benchmark suite.
TS25.5 Toward Parallel Serving for Vision-Language Models via Modal Decoupling and Scheduling 16:50
Yijia Yang¹, Yubo Deng¹, Yida Wang¹, Yuanchao Xu¹, Keni Qiu¹

¹ Capital Normal University

Vision-Language Models (VLMs) have demonstrated strong performance in tasks such as image captioning and visual question answering. Under mixed workloads, however, the differing inference pipelines for text-only and multimodal requests create heterogeneity that existing serving systems fail to optimize—leading to high latency and poor fairness. We propose DuetInfer, a modality-aware serving framework that enhances single-GPU serving efficiency for VLMs through three key contributions: (i) parallel computation enabled by preprocessing parallelism and decoupled vision-language execution, (ii) a shared memory manager that eliminates weight redundancy and supports efficient encoder cache sharing, and (iii) a fairness-aware scheduler that reduces delays for multimodal requests without penalizing text-only ones. Implemented within vLLM and evaluated on realistic workloads, DuetInfer reduces P99 TTFT by up to 33.7% and end-to-end latency by up to 20%.
TS25.6 Dolphium: Co-Optimizing Quantization Dataflow and Paradigms on Poly-Hierarchical NPUs 16:55
Xiuping Cui¹, Chengrui Zhang¹, Xiang Chen¹, Yun (Eric) Liang¹

¹ Peking University

Poly-hierarchical NPUs integrate distributed memory modules with heterogeneous computation units, posing significant challenges for mapping quantized operators. The difficulty arises from the need to coordinate data transfers across memory hierarchies and to assign diverse operations to suitable computation units. In this work, we systematically construct the mapping space from quantization to dataflows by addressing three key aspects: generation of NPU-friendly computation flows, integrated operation–data co-mapping, and determination of transfer granularity and frequency. Building on this foundation, we further exploit quantization dataflows to guide the selection of quantization paradigms. Compared with the state-of-the-art quantization compiler, our mapping achieves a 1.67-2.03x speedup. Moreover, the selected quantization paradigms deliver an average 2.18x efficiency improvement on NPUs without accuracy loss.
TS25.7 Tensor-Compressed and Fully-Quantized Training of Neural PDE Solvers 17:00
jinming Lu¹, Jiayi Tian², Yequan Zhao², Hai Li³, Zheng Zhang²

¹ Nanjing University ; ² University of California, Santa Barbara ; ³ Intel Corporation

Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource- constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein's estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed- precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein's estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton–Jacobi–Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5× to 83.5× speedups and 159.6× to 2324.1× energy savings. This work enables real- time PDE solving on edge devices and paves the way for energy- efficient scientific computing at scale.
TS25.8 DynaMo: Runtime Switchable Quantization for MoE with Cross-Dataset Adaptation 17:05
Zihao Zheng¹, Xiuping Cui¹, Size Zheng², Maoliang Li¹, Jiayu Chen¹, Yun (Eric) Liang¹, Xiang Chen¹

¹ Peking University ; ² ByteDance Ltd

As the Mix-of-Experts (MoE) architecture increases the number of parameters in large models, there is an even greater need for model quantization. However, existing quantization methods overlook the expert dynamics of MoE across multiple datasets, resulting in suboptimal performance. Moreover, the existing static quantization cannot adapt MoE to various data change scenarios. In this paper, we perform a multi-level analysis to reveal MoE dynamics and define the significance of each channel/each expert. Based on the analysis results, we propose DynaMo, an end-to-end MoE quantization framework. DynaMo adopts an expert-level mixed-precision baseline quantization strategy, which ensures the quantized MoEs are compatible with multiple existing datasets. Furthermore, DynaMo incorporates a channel-level dynamic switching mechanism to adapt these quantized MoE models to novel datasets. Experiments show that DynaMo achieves a 2.78~4.54 PPL decrease and a 1.85%~3.77% accuracy improvement in various datasets, with ~3x inference speedup and negligible overhead.
TS25.9 From Cloud-Heavy to Edge-Ready: Self-supervised Transfer-efficient Emotion Recognition 17:10
Junjiao Sun¹, Jose Miranda¹, Jorge Portilla¹, Andres Otero²

¹ Centro de Electrónica Industrial Universidad Politecnica de Madrid ; ² Universidad Politecnica de Madrid

Deploying AI-based emotion recognition at the edge enables real-world applications but is constrained by data scarcity, heavy models, hardware limits, and privacy issues. To overcome these, we propose CHEER (Cloud-HEavy to Edge-Ready), a self-supervised, transfer-efficient framework where the cloud pre-trains lightweight graph-based encoders using unlabeled data, stores them as frozen models, and deploys only the needed encoder. New users are locally matched via centroids of clusters, and a small on-device classifier is trained with minimal labeled data, reducing computation, memory, and energy use while preserving privacy. Experimental results in the WEMAC and WESAD datasets show an accuracy of 78.19% and 80.08% at the edge on a NVIDIA Jetson Orin Nano. Moreover, CHEER achieves more than 60% reduction in model size, and lowers both peak RAM usage and energy consumption by more than 50% compared to the state-of-the-art.
TS25.10 High-Efficiency Neural Beamforming for Real-Time Speech Enhancement on Smart Low-power Hearable Devices 17:11
Luca Bompani¹, Giovanni Oltrecolli², Marco Fariselli³, Francesco Conti²

¹ Department of Electrical, Electronic and Information Engineering, University of Bologna ; ² University of Bologna ; ³ N/A

Accurate, low-latency spatial beamforming is a crucial component in emerging smart hearable devices, enabling speech enhancement while suppressing noise and interference. In this work we present an optimized methodology for real-time execution of a neural network–based minimum variance distortionless response (MVDR) beamformer on resource-constrained microcontroller units (MCUs). With mixed-precision quantization, beamforming weights can be estimated at an energy cost of 13.2 mJ, which, along a fixed-interval scheduling strategy enables end-to-end real-time operation under a 20 ms latency constraint while achieving a short-time objective intelligibility (STOI) score of 88.4. Efficiency is further improved by integrating a speech activity detection network, which bypasses speech enhancement during silence, resulting in a reduction in energy consumption to 0.62 mJ per execution with 98.5\% accuracy. In realistic deployment conditions, we estimate a lifetime of 16 h on a 100 mAh battery.

TS26 Advanced methods for Placement and Partitioning 16:30 - 18:00 | Aida

Chair: Christoph Mueller (EPFL, CH)

Co-Chair: Edo Charbon (EPFL, CH)

TS26.1 PCB-Migrator: Automated PCB PnR Migration 16:30
Yaohui Han¹, Beichen Li², Rongliang Fu¹, Qunsong Ye³, Zhiyuan Lu², Junchen Liu², Bei Yu¹, Tsung-Yi Ho¹, Tinghuan Chen²

¹ The Chinese University of Hong Kong ; ² The Chinese University of Hong Kong, Shenzhen ; ³ Shenzhen index tech.

Despite the availability of numerous frameworks and tools for automated PCB placement and routing, the industry still relies heavily on expert designers to ensure layout reliability and performance. However, when design requirements change, such as adjustments to board dimensions or the addition of new obstacles, experts must often recreate similar layouts from scratch, leading to substantial inefficiencies in both time and resources. To address this challenge, we introduce PCB-Migrator, an automated framework for PCB layout migration. Our approach leverages an offset constraint graph to capture positional relationships among components in the referenced design and effectively map them onto the new PCB. Additionally, PCB-Migrator builds routing path graphs to extract routing characteristics from the reference layout and applies graph matching to guide the routing process on the new board. Experimental results demonstrate that PCB-Migrator outperforms existing baselines, achieving faster runtimes while preserving the key design characteristics and performance of the referenced PCB.
TS26.2 Dynamic Algorithm Configuration for Global Placement 16:35
Chen Lu¹, Ke Xue¹, Ruo-Tong Chen¹, Yunqi Shi¹, Siyuan Xu², Mingxuan Yuan², Chao Qian¹, Zhi-Hua Zhou¹

¹ Nanjing University ; ² Huawei Noah's Ark Lab

Placement is a vital step in the physical design flow of very large-scale integration (VLSI) circuits. GPU-accelerated analytical placement algorithms, such as DREAMPlace, have achieved high-quality performance with dramatic speedup. The algorithm configurations of the analytical placer have a significant impact on its convergence and final performance. However, its tuning process is difficult and time-consuming. Recently, AutoDMP tries to search for optimal static algorithm configurations using Bayesian optimization, but the performance is still limited due to its static strategy, which cannot leverage information during algorithm execution. In this paper, we propose the dynamic algorithm configuration framework for DREAMPlace (DACDMP), using reinforcement learning (RL) to learn the dynamic control policy of the most critical hyperparameter, i.e., the learning rate. Moreover, to address the insufficiency of optimization, we increase the number of optimization steps in each Lagrangian relaxation problem, thereby improving the solution's optimality. DACDMP outperforms the current leading methods, i.e., DREAMPlace 4.0, AutoDMP, and Xplace. For example, compared to DREAMPlace 4.0, it achieves an average improvement of 2.75% in wirelength, 18.74% in worst negative slack (WNS), 44.60% in total negative slack (TNS), and 29.39% in the number of violation points on the ICCAD 2015 benchmark.
TS26.3 Timing-Driven Detailed Placement with Collaborative Topology Reconstruction 16:40
Zhengjie Zhao¹, Wenxin Yu², Jie Ma¹, Mengshi Gong¹, Youzhi Zheng², Xinmiao Li¹, Wenyu Liu², Jingwei Lu³

¹ Southwest University of Science and Technology ; ² Southwest University Of Science And Technology ; ³ TikTok

Placement is a critical step in the physical design, as it largely determines the potential for subsequent optimization. In this work, we propose a timing-driven detailed placement framework: first, a simplified RC-tree model is employed for flip-flop–buffer compensation; then, a gradient-augmented global heuristic algorithm is incorporated; and finally, timing improvement is achieved through local collaborative optimization. A comprehensive evaluation on eight ICCAD 2015 benchmarks demonstrates the effectiveness of our approach. Compared to DREAMPlace4.0-DP, a state-of-the-art timing-driven placer, our framework achieves an average improvement of 19.60% in WNS and 55.74% in TNS, while introducing less disturbance to the global placement. Moreover, it delivers a 0.80% reduction in HPWL and reduces runtime by 20.24%.
TS26.4 HGNN-Part: A High-Quality Hypergraph Partitioner Based on Hypergraph Generative Model 16:45
Shengbo Tong¹, Rufan Zhou¹, Chunyan Pei¹, Wenjian Yu¹

¹ Tsinghua University

Hypergraph partitioning is a fundamental combinatorial optimization problem with critical applications in VLSI design. While recent deep learning based approaches have shown promise for this problem, they rely on graph neural networks (GNNs) that require transforming hypergraphs into normal graphs, thereby losing the high-order relationships in hypergraph structures. In this work, we propose a novel framework that directly utilizes hypergraph neural networks (HGNNs) to exploit the high-order interactions in hypergraphs. We develop an efficient normalized cut loss computation algorithm optimized for GPU training and apply randomized matrix decomposition techniques to significantly accelerate the eigenvector computation required for node feature extraction without sacrificing quality. To address the scarcity of open-source hypergraph data, we release a comprehensive dataset with 164 VLSI hypergraphs collected from various EDA contests and benchmarks. Extensive experiments on the ISPD98 and ISPD05 benchmarks demonstrate that our method achieves superior partitioning quality compared to state-of-the-art approaches, including multilevel methods (hMETIS), spectral methods (SpecPart, K-SpecPart), and recent deep learning based approaches (MedPart, GenPart). Furthermore, training on our expanded dataset yields additional performance gains, validating the framework's ability to leverage larger training data effectively.
TS26.5 HPPlacer: A High-Precision Slack-Aware Global Placement Engine 16:50
Qinggong Shen¹, Chaoli Zhang¹, Haoyang Xu², Zhiwen Yu¹, Bin Guo¹, Yuxuan Zhao², Bei Yu², Tsung-Yi Ho², Xing Huang¹

¹ Northwestern Polytechnical University ; ² The Chinese University of Hong Kong

Timing-driven global placement plays a decisive role in the final performance of very large-scale integration (VLSI) circuits, but is consistently challenged by the trade-off between design accuracy and efficiency. Most existing methods rely on coarse-grained net-weighting strategies. While these approaches are straightforward to implement, they cannot precisely identify and optimize complex timing paths, such as paths with sharing effects or large slack deviations. To overcome this bottleneck, we propose a high-precision slack-aware global placement engine called HPPlacer, which includes the following three key techniques: 1) a local clock buffer-to-flip-flop connection optimization method, 2) a path-level differentiable timing optimization model, and 3) a dynamic adjustment mechanism-based pin-pair weighting strategy. With the proposed method, efficient chip placement with excellent timing behaviors can be generated automatically within a short period of time. The experimental results on multiple benchmark circuits confirm that HPPlacer leads to significant improvements in both timing performance and wirelength compared to state-of-the-art placement tools.
TS26.6 Timing-driven Detailed Placement via TimingMask-guided Path-level Optimization 16:55
Ruo-Tong Chen¹, Chengrui Gao¹, Siyuan Xu², Ke Xue¹, Yunqi Shi¹, Xi Lin³, Mingxuan Yuan², Chao Qian¹, Zhi-Hua Zhou¹

¹ Nanjing University ; ² Huawei Noah's Ark Lab ; ³ Nanjing University, China

Timing-driven detailed placement is a critical stage in very large scale integrated (VLSI) design, aiming to locally adjust cell positions to further improve circuit timing performance. Existing methods commonly adopt proxy metrics as optimization objectives, such as weighted wirelength and approximate delay. However, these surrogate metrics are not fully aligned with the final timing metrics obtained through static timing analysis (STA), often leading to suboptimal timing results. Besides, methods based directly on STA tools suffer from very low search efficiency, making the cost of timing optimization prohibitive. To address these issues, we propose an effective timing-driven detailed placement method via TimingMask-guided path-level optimization. One core of our method is the TimingMask guidance mechanism, which integrates both arc delay and path slack information based on the RC timing model, thereby providing more targeted and effective guidance for refinement of critical cells. Meanwhile, our method adopts a path-level timing evaluation strategy with incremental updates, accelerating the optimization process while preserving timing accuracy. Experimental results on the ICCAD 2015 contest benchmarks demonstrate that our method significantly outperforms state-of-the-art detailed place- ment methods such as DREAMPlace4.0 DP, achieving an average improvement of 25.3% in total negative slack (TNS) and 21.7% in worst negative slack (WNS).
TS26.7 3D Chiplet Partitioning and Floorplanning Interaction with Vertical Bonding Consideration 17:00
Tong Shen¹, Mengen Chen¹, Xu He¹, Yao Wang², Yang Guo³

¹ Hunan University ; ² Independent Researcher ; ³ National University of Defense Technology

The emerging technologies of 3D chiplet integration offer a promising path to increase functional density and communication efficiency beyond the limitations of traditional 2D designs. Among various methods, hybrid bonding enables high-density vertical connections with reduced parasitics and latency. However, few existing approaches explicitly support this vertical interconnect scenario, and most treat partitioning and floorplanning as separate stages. To fully exploit the benefits of vertical integration, partitioning and floorplanning need to be considered jointly to balance bond demand and bond supply. In this paper, we propose a unified framework for 3D chiplet partitioning and floorplanning under fine-pitch bonding technologies. Experiments on industry-standard benchmarks demonstrate a 10%~40% reduction in HPWL and over 60% decrease in inter-die vertical connection overflow, confirming the effectiveness and superiority of our approach over current methods.
TS26.8 PPA-driven Placement via Adaptive Cluster Constraints Optimization 17:05
Ziyan Liu¹, Siyuan Xu², Jie Wang¹, Zijie Geng¹, Yeqiu Chen¹, Mingxuan Yuan², Jianye Hao², Feng Wu¹

¹ University of Science and Technology of China ; ² Huawei Noah's Ark Lab

The clustering-based placement framework has demonstrated promising potential in improving the efficiency and quality of very-large-scale integration (VLSI) placement. However, existing methods typically impose unified and rule-based constraints on different clusters, overlooking the unique intra- and inter-cluster connection properties that vary across clusters, which leads to suboptimal results. To address this challenge and promote effective PPA optimization, we introduce an innovative PPA-driven placement paradigm with mixed-grained Adaptive Cluster Constraints Optimization (ACCO), which applies constraints with customized constraint tightness to different clusters, balancing local and global interactions for improved placement performance. Specifically, we propose a novel eBound model with quantified constraint tightness, combined with a Bayesian optimizer to dynamically adjust the constraints for each cluster based on PPA outcomes, which are ultimately passed on to the final flat placement. Experimental results on benchmarks across various domains show that our methods can achieve up to 62%, 97% and 25% improvements in post-route WNS, TNS and power compared to existing methods.
TS26.9 Towards learning-based gate-level glitch analysis 17:10
Anastasis Vagenas¹, Dimitrios Garyfallou¹, George Stamoulis¹

¹ University of Thessaly

In advanced technology nodes, accurate glitch modeling is crucial for designing high-performance, energy-efficient, and reliable integrated circuits. In this work, we present a new approach for gate-level glitch propagation modeling, employing efficient Artificial Neural Networks (ANNs) to accurately estimate glitch shape characteristics, propagation delay, and power consumption. Moreover, we propose an iterative workflow that integrates our models into standard cell libraries, exploiting the available accuracy and size trade-off. Experimental results on gates implemented in 7 nm FinFET technology indicate that our ANNs exhibit a strong correlation with SPICE (R2 over 0.99). Therefore, our approach could enable accurate full-chip glitch analysis and effectively guide glitch reduction techniques.
TS26.10 Drafting and Multi-Input Switching in Digital Dynamic Timing Simulation for Multi-Input Gates 17:11
Arman Ferdowsi¹, Ulrich Schmid², Josef Salzmann²

¹ University of Vienna ; ² TU Wien

Trace-history-dependent effects such as drafting and multi-input switching are poorly modeled in static timing analysis, yet do not justify excessive transistor-level analog simulations. We present a closed-form analytic delay model for fast digital dynamic timing analysis of interconnected NOR gates, which captures both effects. Our delay formulas are derived from a thresholded hybrid gate model based on non-constant-coefficient differential equations, which can be analytically parametrized via a few characteristic gate delay values. By utilizing our formulas in the discrete-event simulator-based Involution Tool, we show that accurate circuit simulation can be done at roughly inertial-delay cost. The significantly improved timing prediction accuracy of our delay model is demonstrated by two representative benchmark circuits.

TS27 Advances in modelling and mitigation of defects, faults, variability, and reliability 16:30 - 18:00 | Nabucco

Chair: Riccardo Cantoro (Politecnico di Torino, IT)

Co-Chair: Rene Krenz-Baath (TH-Wildau, DE)

TS27.1 Node2Node: Node Adaptation with Transformer for Cross-Node Hotspot Detection 16:30
Wenbo Xu¹, Silin Chen¹, Yibo Huang¹, Xinyun Zhang², Zixiao WANG³, Bei Yu³, Ningmu Zou¹

¹ Nanjing University ; ² ShanghaiTech University ; ³ The Chinese University of Hong Kong

As semiconductor manufacturing advances to smaller process nodes such as 7nm, 5nm, and 3nm, hotspot detection has become crucial for ensuring the manufacturability and reliability of integrated circuit (IC) layouts. Existing hotspot detection methods are entirely supervised, relying heavily on labeled data tailored to specific process nodes. Consequently, these approaches struggle to generalize effectively across different nodes due to significant variations in layout geometry and manufacturing processes. In this paper, we propose Node2Node, the first node adaptation framework designed explicitly for cross-node hotspot detection. Node2Node leverages the Transformer-based architecture, introducing a novel Node-Invariant Encoder and Node-Specific Encoder to effectively capture both node-invariant and node-specific features. Furthermore, we develop a Bidirectional Center Alignment strategy, which dynamically refines pseudo-labels by incorporating few labeled data from the target node to enhance their reliability, and propose a Cross-Node Distribution Loss to explicitly align the feature distributions across nodes. Extensive experiments demonstrate that Node2Node significantly enhances cross-node generalization, achieving superior hotspot detection performance.
TS27.2 Population coding to improve fault tolerance of neuromorphic networks in regression tasks 16:35
Alexis Gleyo¹, Bernard GIRAU¹

¹ Université de Lorraine, CNRS, LORIA

Spiking Neural Networks (SNNs) and specialized neuromorphic hardware represent a promising prospect for energy-efficient computation. However, this hardware is susceptible to permanent faults, such as dead or saturated neurons, which can compromise the model's reliability. As semiconductor technologies advance toward ever-smaller feature sizes, process variations and defect rates increase, making fault tolerance a critical requirement. The intrinsic robustness of neural computation, inspired by biological systems, offers an opportunity to develop more sustainable neuromorphic design practices—by enabling the use of partially defective chips both at manufacturing time and during long-term deployment. In this context, we argue that population coding provides an additional layer of fault resilience, as it allows neural models to tolerate hardware-level defects without requiring retraining or architectural modifications. This paper investigates the inherent and passive fault tolerance conferred by population coding as a robustness strategy in regression tasks. We propose a methodology where continuous variables are represented using Gaussian Receptive Field (GRF) population encoding and decoded from the SNN's output using a Maximum Likelihood Estimation (MLE) method designed to mitigate the influence of faulty neurons. We systematically evaluate this approach through fault injection experiments by introducing an increasing number of faults across different network layers. Our results demonstrate that population-coded models may be significantly more resilient to permanent faults than those using a direct single-neuron output. This work validates that population coding provides a powerful architecture for fault-tolerant neuromorphic systems that does not have the overhead of active fault detection and reconfiguration hardware.
TS27.3 Scalable Second-Order Optimizer for Full-Chip Inverse Lithography Techniques 16:40
Su Zheng¹, Ziyang Yu¹, Bei Yu¹, Martin Wong¹

¹ The Chinese University of Hong Kong

Full-chip inverse lithography techniques (ILT) represent an advanced methodology for next-generation mask optimization, enhancing sub-wavelength patterning but often facing prohibitive computational costs. State-of-the-art methods rely on iterative first-order optimizers, which exhibit slow convergence, often requiring hundreds of iterations. This inefficiency is compounded by the high overhead of repeated Fast Fourier Transform (FFT) operations and inter-GPU communication per iteration. To overcome this fundamental bottleneck, we propose a scalable second-order optimizer for full-chip ILT. Our approach leverages second-order curvature information via the Hessian matrix to achieve dramatically faster convergence and superior pattern fidelity compared to conventional first-order methods. Crucially, we address the prohibitive cost of exact Hessian computation by employing Hutchinson's method to efficiently approximate the Hessian diagonal. Combined with exponential moving average (EMA) and gradient modulation techniques, our optimizer achieves significant performance gains. Experimental results demonstrate substantial improvements in both runtime efficiency (reduced iterations) and solution quality (enhanced pattern fidelity) compared to existing first-order ILT methods, paving the way for practical full-chip ILT.
TS27.4 Ramen: Radiation-Aware Modeling Framework for PDK-Enabled Design and Library Characterization 16:45
Quan Cheng¹, Haoyuan Li², Zhenzhe Chen³, Wang LIAO⁴, Jing-jia Liou⁵, Masanori Hashimoto³, Longyang Lin⁶

¹ Brown University ; ² Xi'an Jiaotong University / Kyoto University ; ³ Kyoto University ; ⁴ Kochi University of Technology ; ⁵ National Tsing Hua University ; ⁶ Southern University of Science and Technology

Radiation-induced degradation poses a critical challenge to the reliability of space-grade integrated circuits (ICs). Existing radiation-aware models largely remain at the device level and lack direct integration with circuit or system design flows, limiting their practical use in radiation-aware IC design. To address this, this work proposes Ramen, a non-invasive radiation-aware device modeling framework that is fully compatible with commercial Process Design Kits (PDKs). Ramen accurately captures total ionizing dose (TID) and displacement damage dose (DDD), enabling early-stage evaluation at both circuit and system levels without requiring modifications to existing PDK structures. By seamlessly integrating with standard analog, mixed-signal, and digital flows, the radiation-aware models not only support SPICE-based circuit simulation but also feed into standard library characterization tools to generate radiation-aware Liberty libraries. These libraries encode dose-dependent timing, leakage, and power information, allowing radiation effects to be captured in synthesis, timing analysis, and back-end implementation. Experimental validation on a 180 nm CMOS imager under radiation stress shows that the proposed framework achieves <15% simulation errors for both analog and logic circuit, confirming the reliability of Ramen for radiation-aware IC design.
TS27.5 No Pixel, More Efficient: Multimodal Framework for Sub-nm Mask Process Correction 16:50
Kai Ma¹, Tianyi Li¹, Jiaqi Liu², Jingyi Yu¹, Hao Geng¹

¹ ShanghaiTech University ; ² Shanghai Optoelectronics Science and TechnologyInnovation Center

Mask Process Correction (MPC) is a critical step in advanced semiconductor manufacturing to mitigate pattern errors from e-beam writing and etching. However, conventional CPU-based commercial tools create a severe computational bottleneck as layout sizes and polygon complexity increase, resulting in prohibitively long MPC correction times. While academic efforts have explored using GPUs and advanced image-based machine learning algorithms on mask images, the inherent inefficiency of pixel mask representation results in an intractably large parameter space at the full-tile scale, limiting the practicality of these methods for industrial production. In this paper, we introduce a novel multimodal framework that processes point-cloud representations of GDS masks alongside E-beam Lithography (EBL) modeling information. This multimodal approach enables massively parallel processing, preserves pattern fidelity, and ensures adaptability to variable e-beam lithography conditions. Experimental results show our method significantly reduces runtime while achieving edge placement error performance comparable to commercial tools.
TS27.6 TRACE: A Transferable Framework for Aging-aware Cell Delay Estimation 16:55
Muyan Jin¹, Chao Yang², Yunlin Liu², Zejian Cai³, Pengpeng Ren⁴, Zhigang Ji⁵

¹ National Key Laboratory of Science and Technology on Micro/Nano Fabrication, Shanghai Jiao Tong University ; ² The DFR Team of T-Head's Shanghai COT Group ; ³ The DFR Team of T-Head's Shanghai COT Group. ; ⁴ Shanghai Jiao Tong University ; ⁵ Shanghai Jiaotong University

With the continuous scaling of integrated circuits and the miniaturization of semiconductor devices, reliability issues have become increasingly critical. Aging delay prediction based on standard cells is essential for accurate circuit timing analysis. However, the growing diversity of process technology combinations poses significant challenges to the generalization capability of existing AI-based prediction methods. To address this, we propose a novel framework that first employs a graph neural network (GNN) to train a pre-trained model for delay prediction. Building upon this pre-trained model, we introduce a multi-task learning strategy combined with transfer learning to accelerate the training process and enhance adaptability across varying process conditions. This approach culminates in a unified model capable of accurate and efficient post-aging delay estimation. Experiments show that our method accelerates the simulation process by 17,025× compared to SPICE. At the same time, it achieves prediction accuracy comparable to the current state-of-the-art, while requiring 250× less data for training, substantially reducing computational resources.
TS27.7 An IR drop-robust Mapping Method for Reliable Memristive Accelerators 17:00
Jinpeng Liu¹, Shiyi Song², Bing Wu², Huan Cheng¹, Heng Zhou¹, Xueliang Wei¹, Wei Tong³, Dan Feng¹

¹ Huazhong University of Science and Technology ; ² Huazhong university of science and technology ; ³ HUST

Memristive accelerators (MAs) facilitate efficient matrix-vector multiplication (MVM) by performing in situ computation within memory crossbar arrays, thereby ensuring a fast and energy-efficient application acceleration. A significant challenge associated with the MA lies in the limited computing accuracy resulting from the IR drop deviation. However, existing works on IR drop mitigation provide approximate compensation, resulting in less accurate results. The objective of this paper is to propose an IR drop-robust mapping method for reliable memristive accelerators. Firstly, the IR drop-robust mapping (IRM) method is proposed, which exploits the residuals between the equivalent matrix after the IR drop deviation and the original matrix, and iteratively maps them to the crossbars for IR drop compensation. Based on the IRM method, a novel mechanism of the matrix-vector multiplication (MVM) operation is derived, ensuring that MVM is computed correctly. Secondly, the calibrate-shift-reflect (CSR) strategy is developed to enhance the IRM method and decrease the number of crossbars required. The calibration process tries to adjust the matrix programmed to the array, so that its equivalent is closer to the original matrix, and the residual becomes smaller. The shift-reflect process attempts to map elements with larger values in the residual matrix to positions in the crossbar where the IR drop deviation is mild, so that the residual decreases fast. Thirdly, the word line selectors and drivers shared between two neighboring crossbars, a word line decoder and an analog arithmetic unit (AAU) per crossbar, are incorporated into the MA as the hardware support for the IRM method. The experimental results indicate that the IRM-CSR method can effectively mitigate the IR drop deviation, restoring inference accuracy by 9.27% ~ 84.85% (for neural network applications), and achieving a reduction in the relative root-mean-squared error by 10^3× ~ 10^10× (for scientific computing), compared with state-of-the-art methods.
TS27.8 Interpretable Graph Neural Networks for Fault Detection in Circuit Netlists 17:05
Rupesh Karn¹, Johann Knechtel¹, Ozgur Sinanoglu¹

¹ New York University Abu Dhabi

This work presents a framework for accurate and interpretable fault detection in digital circuit netlists using multiple state-of-the-art Graph Neural Network (GNN) architectures. Targeting three representative fault types—stuck-at, bridging, and glitch—we formulate the detection task as a multi-label node classification problem. Using a robust parsing pipeline, we construct graph datasets from the ISCAS85 and EPFL benchmarks, embedding both structural attributes and novel relational features. Results demonstrate that most GNNs achieve over 90% accuracy, with the proposed relational features consistently boosting performance. Furthermore, we leverage these relational features for model interpretability, successfully highlighting features most relevant to circuit faults.
TS27.9 Analysis and Mitigation of IR Drop in Memristor-based AI Hardware Accelerators 17:06
Emmanouil Arapidis¹, Theofilos Spyrou¹, Konstantinos Stavrakakis¹, Emmanouil-Anastasios Serlis¹, Moritz Fieback¹, Said Hamdioui¹, Anteneh Gebregiorgis¹

¹ Delft University of Technology

Although offering great potential for energy-efficient edge-AI, memristor-based CIM accelerators are severely hindered by IR drop induced errors. To tackle this, we propose a lowcost mitigation technique by first quantifying the impact of IR drop on the accuracy. Then, a mitigation strategy is developed to compensate for IR drop-induced inference accuracy reduction by combining an optimized mapping scheme with a fine-tuned calibration of the ADC. Results show the proposed solution can effectively mitigate IR drop with a negligible overhead.
TS27.10 An Efficient Weight Correction Method to Recover Non-ideal Errors in Pruned IRC Designs 17:07
Shih-Han Chang¹, Yi-Min Pan¹, Chong-En Hong¹, Chien-Nan Liu¹

¹ National Yang Ming Chiao Tung University

Resistive Random Access Memory (RRAM) has emerged as a leading candidate for In-Memory Computing (IMC) to accelerate Deep Neural Network (DNN) applications. Combined with proper pruning techniques, the cost and energy of DNN computation can be further reduced. However, besides the various non-ideal effects in analog computation, the pruned In-RRAM Computing (IRC) designs also suffer from extra quantization errors, which may significantly degrade the accuracy of DNN. Instead of recovering the errors after circuit implementation, this paper proposes a weight correction method to perform error compensation in advance. By adding proper margins to the stored weight values, the errors caused by IR drop, thermal effects, thermal crosstalk, and weight pruning in IRC designs can be considered simultaneously without additional circuit overhead. As shown in the experimental results, the proposed method effectively recovers the computing accuracy under various pruning rates, thus enhancing the performance and reliability of IRC designs.

ET02 Quantum Circuit Compilation And Optimization 16:30 - 18:00 | Rigoletto

ET02.1 Quantum Circuit Compilation And Optimization 16:30
Noureldin Yosri¹, Dmitri Maslov¹

¹ Google

This tutorial is a hands-on demonstration that presents to quantum computing researchers and engineers the different optimization techniques, where they apply and their effect. This tutorial provides a comprehensive overview of quantum circuit optimization, beginning with the foundational principles of classical reversible circuit synthesis and optimization. It then progresses to the core challenges of quantum compilation, detailing the complete pipeline of unitary synthesis, qubit routing, and gate optimization. The presentation evaluates different strategies, contrasting exact versus approximate compilation techniques and exploring the use of machine learning in solving these NP-hard problems. The last portion is dedicated to the practical demands of fault tolerance, examining the trade-offs between compute and different storage strategies. The tutorial concludes by introducing the ZX-calculus as a powerful graphical formalism for representing and reasoning about these complex quantum optimization tasks.

TH SIGDA Vision 2030 Town Hall 16:30 - 18:00 | Figaro

This session is an open forum organized by ACM SIGDA to engage the community in shaping the future of Design Automation. We will present initial insights from the SIGDA Vision 2030 survey, discuss synergies with broader initiatives that can support talent development in the community, and gather feedback through interactive discussion and live polls. All members of the community are welcome to participate.

Chair: Christian Pilato (Politecnico di Milano, IT)

Co-Chair: Alessandro Savino (Politecnico di Torino, IT)

FS06 Focus Session: Who is Best Suited to do Verification? (Panel) 16:30 - 18:00 | Auditorium

Despite many improvements and a continuously increasing number of engineers verifying and validat-ing designs, the number of re-spins required for a working chip is continuously increasing, so Harry D. Foster ? Chief Scientist Verification at Siemens EDA ? in the 2024 Wilson Research Group IC/ASIC func-tional verification trend report. This session approaches the long-lasting topic of the verification gap by discussing if the right people with the right skills are responsible for verification: Verification Engineers, Design Engineers, or Concept Engineers. We discuss how roles, skills and responsibilities may be dis-tributed among these groups with the goal to make verification more efficient, with a higher quality and less costly. The session is structured in three 15 minutes impulse talks followed by 45 Minutes discussion amongst the panelists and the auditorium.

Chair: Wolfgang Kunz (RPTU Kaiserslautern-Landay, DE)

Co-Chair: Keerthikumara Devarajegowda (Siemens EDA, DE)

Organizers:

Wolfgang Ecker (Infineon Technologies AG, DE)

FS06.1 Subhasish's Position 16:30
Subhasish Mitra¹

¹ Stanford University

None of these folks should spend significant time on verification. New breakthroughs in verifica-tion (G-QED and its cousins) enable quick, thorough, and scalable verification of large hardware accelerators, beyond the reach of existing commercial Bounded Model Checking (BMC) tools. Re-sults from 72 designs, including 21 industrial designs from three major companies, demonstrate: 1. Detection of previously unknown "chip-killing" bugs in designs that had already passed tradi-tional verification (in addition to detection of all bugs detected by traditional methods). 2. Scalability to designs with 250 million gates and bugs requiring 300K cycles to manifest – capa-bilities unattainable by existing BMC tools. 3. Drastic verification productivity benefits from sever-al person-months (using traditional verification) to only a few days or weeks. Such new verifica-tion is a key enabler for AI-boosted chip design to achieve drastic design productivity.
FS06.2 Luca's Position 16:45
Luca Benini¹

¹ Università di Bologna and ETH Zurich

Design Engineers are best suited as they know the design and its critical functional areas where bugs are more probable. As it comes to verification, we expect that the rise of design AI-based verification automation (Generative, Agentic) based on large foundational models, to greatly en-hance the designers' capability to drive and manage the verification process. Verification engi-neers will still have a role in defining methodologies, developing and tuning the AI infrastructure.
FS06.3 Wolfgang's Position 17:00
Wolfgang Ecker¹

¹ Infineon Technologies AG

Concept Engineers are best suited as they know the design intent, and the chip's architecture as well as micro-architecture that must be built. When using formal or semi-formal specifications, concept engineers can use static and dynamic checks to detect gaps and bugs already in the speci-fication. Formal specifications can also be used to automatically generate verification collaterals that reduce the verification effort on the implementation level substantially or automate it in near future completely.

PARTY DATE Party 19:30 - 23:00 | Buvette

The DATE Party is supported by the University of Verona.

Chair: Valeria Bertacco (University of Michigan, US)

Co-Chair: Alberto Bosio (École Centrale de Lyon, FR)

Party.01 Arrival & Welcome 19:00
Valeria Bertacco¹, Alberto Bosio²

¹ University of Michigan ; ² Ecole Centrale de Lyon
Party.02 Presentation of Awards 19:15
Jürgen Teich¹, Robert Wille², Christian Pilato³, Valeria Bertacco⁴, Alberto Bosio⁵

¹ Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) ; ² Technical University of Munich ; ³ Politecnico di Milano ; ⁴ University of Michigan ; ⁵ Ecole Centrale de Lyon
Party.03 LIVE OPERA PERFORMANCE 19:30
Fondazione Arena di Verona¹

¹ DATE
Party.04 VINTAGE ROCK NIGHT 21:00
Søsken Band¹

¹ DATE
Party.05 Mix, Mingle, and Food & Drinks 20:00
Valeria Bertacco¹, Alberto Bosio²

¹ University of Michigan ; ² Ecole Centrale de Lyon

2026-04-22

TS28 Design and test of hardware security primitives 08:30 - 10:00 | Traviata

Chair: Paolo Maistri (TIMA Laboratory, FR)

Co-Chair: Michael Hutter (UniBw M/PQShield, DE)

TS28.1 A Graph-Theoretic Framework for Randomness Optimization in First-Order Masked Circuits 08:30
Dilip Shanmugasundaram Veeraraghavan¹, Benedikt Gierlichs², Ingrid Verbauwhede²

¹ KU Leuven ; ² KU Leuven - COSIC

We present a generic, automatable framework to reduce the demand for fresh randomness in first-order masked circuits while preserving security in the glitch-extended probing model. The method analyzes the flow of randomness through a circuit to establish security rules based on the glitch-extended probing model. These rules are then encoded as an interference graph, transforming the optimization challenge into a graph coloring problem, which is solved efficiently with a DSATUR heuristic. Crucially, the optimization only rewires randomness inputs without altering the core logic, ensuring seamless integration into standard EDA flows and applicability to various gadgets such as DOM-indep (Domain-Oriented Masking) and HPC (Hardware Private Circuits). On 32-bit adder architectures, the framework substantially reduces randomness requirements by 79–90%; for instance, the Kogge–Stone adder's requirement of 259 unique random inputs is reduced to 27. All optimized designs were evaluated using PROLEAD, with the leakage results indicating compliance with first-order glitch-extended probing security.
TS28.2 Extending and Accelerating Inner Product Masking with Fault Detection via Instruction Set Extension 08:35
Songqiao Cui¹, Geng Luo², Junhan Bao³, Josep Balasch⁴, Ingrid Verbauwhede⁵

¹ KU Leuven ; ² National University of Singapore ; ³ Independent Researcher ; ⁴ Rambus ; ⁵ KU Leuven - COSIC

Inner product masking is a well-studied masking countermeasure against side-channel attacks. IPM-FD further extends the IPM scheme with fault detection capabilities. However, implementing IPM-FD in software especially on embedded devices results in high computational overhead. Therefore, in this work we perform a detailed analysis of all building blocks for IPM-FD scheme and propose a Masked Processing Unit to accelerate all operations, for example multiplication and IPM-FD specific Homogenization. We can then offload these computational extensive operations with dedicated hardware support. With only 4.05% and 4.01% increase in Look-Up Tables and Flip-Flops (Random Number Generator excluded), respectively, compared with baseline cv32e40p RISC-V core, we can achieve up to 16.55x speed-up factor with optimal configuration. We then practically evaluate the side-channel security via uni- and bivariate Test Vector Leakage Assessment which exhibits no leakage. Finally, we use two different methods to simulate the injected fault and confirm the fault detection capability of up to k-1 faults, with k being the replication factor.
TS28.3 Rejection Matters: Efficient Non-Profiling Side-Channel Attack on ML-DSA via Exploiting Public Templates 08:40
Yuhan Zhao¹, Wei Cheng¹, Zehua Qiao¹, Yuejun Liu¹, Yongbin Zhou¹

¹ Nanjing University of Science and Technology

ML-DSA (formerly CRYSTALS-Dilithium), NIST's primary post-quantum signature standard, is increasingly deployed along with the post-quantum transitions. Yet when the implementations of ML-DSA are deployed in practice, their physical security remains underexplored. In this work, we reveala new attack surface against ML-DSA by exploiting the leakages from both rejected signing trials and the final accepted signing trial simultaneously. We present, to the best of our knowledge, the first non-profiling side-channel attack that can thereby reduce the number of traces by around 75% on average for a successful attack. Specifically, our method first recovers multiple rejected challenges c via a public template attack, where the leakage templates are built only from publicly available data. With cknown, we then perform CPA on the sensitive intermediates using traces from both rejected and accepted signing trials, quadrupling (on average) exploitable leakage per signing request for ML-DSA-44. The experimental results on real power measurements from an ARM Cortex-M4 board show that challenges c are fully recovered with only 96 traces, and then the key recovery succeedsin around 300 traces — 10 times fewer than prior art. We highlight that our attack can be applied across all three ML-DSA variants with different security levels. Moreover, our attack works straightforwardly in the hedged (non-deterministic) mode of ML- DSA, demonstrating that the hedging offers no SCA protection in this scenario.
TS28.4 Black-Box Robustness Probing of Graph Neural Networks for VLSI Circuit Netlists 08:45
Rupesh Karn¹, Johann Knechtel¹, Ozgur Sinanoglu¹

¹ New York University Abu Dhabi

Graph Neural Network (GNN) models become increasingly popular due to their native ability to represent complex integrated circuits as graph data. However, many deployed models remain black boxes with unexamined potential vulnerabilities, including lack of robustness against perturbations in data distributions. We present a framework for black-box probing for GNN robustness via input-output queries only, utilizing key metrics such as Jacobian, Lipschitz constants, Hessian, prediction margins, robustness radius, and noise stability, relating them all to model performance. We assess various GNN models and seminal architectures, including GraphSAINT, GraphSAGE, GIN, and GAT, all operating on the well-known ISCAS'85 and EPFL benchmarks. We consider gate classification and hardware Trojan detection, the latter being a task that requires excellent robustness by nature. Across node-, subgraph-, and graph-level operation, we find that even highly accurate GNNs can exhibit notable local fragility under perturbations. Overall, our works calls for more stringent consideration of robustness for GNN integration, especially when utilizing third-party service providers, and our framework provides well-defined means for an independent evaluation of this challenge.
TS28.5 Compact Yet Fast: An Efficient d-Order Masked Implementation of Ascon 08:50
Mattia Mirigaldi¹, Nico Paninforni¹, Maurizio Martina², Guido Masera²

¹ Politecnico of Turin ; ² Politecnico di Torino

In this work, we present a generic side-channel protected design of Ascon that achieves high efficiency by dynamically reconfiguring the hardware countermeasures during message processing. The resultant implementation is protected and capable of meeting stringent performance requirements whilst minimising resource overhead. The experimental results obtained demonstrate that the implementation meets the required security standards and achieves superior throughput-to-area ratio across all protection orders. Ascon, recently selected by NIST as the lightweight cryptography standard, is widely deployed in resource-constrained devices that demand both high performance and resistance against threats such as side-channel analysis (SCA). Exploiting Ascon's mode-level structure, which does not require protection against differential power analysis during bulk operations, we introduce a modified masking gadget with dual functionality: serving as a countermeasure during critical operations, and processing multiple data paths in parallel to accelerate bulk computation. Our architecture supports any configurable security order and instantiates only the minimum hardware resources needed to maximize throughput per round. We also evaluate an enhanced Ascon architecture based on the Changing of the Guards technique, which eliminates the need for fresh randomness. Security validation is performed using fixed-vs-random t-tests on both first- and second-order masked implementations. Finally, we compare our masked design against state-of-the-art solutions.
TS28.6 HEED: A Highly Efficient Electromagnetic Fault Detection Scheme 08:55
Roukoz Nabhan¹, Mohammad Ebrahimabadi², jean-luc danger³, Jean-Max Dutertre⁴, Sylvain Guilley⁵, Naghmeh Karimi², Raphael Viera⁶, Iyad Zaarour⁶

¹ Mines Saint-Etienne, CEA, Leti, Centre CMP, F-13541 Gardanne, France ; ² University of Maryland Baltimore County ; ³ Télécom Paris ; ⁴ Mines Saint-Etienne ; ⁵ Secure-IC ; ⁶ Mines de Saint-Étienne, CEA, Leti, Centre CMP, Gardanne, France

ElectroMagnetic Fault Injection (EMFI) is a hardware attack technique that uses EM perturbations to deliberately induce faults in integrated circuits for attack purposes. In this paper, we propose to use a digital sensor based on a Time-to-Digital Converter (TDC) to detect such EMFI attacks. A TDC uses a delay line to sense variations in a device's core voltage at the rate of its clock. Thus, it can detect EMFI attacks involving voltage and clock signal perturbations. The sensor output is expressed as a digital index, FN, which captures EMFI-induced delay variations. We evaluated the sensor's effectiveness on real silicon using an FPGA test vehicle through extensive experiments. The results demonstrate that a single sensor can efficiently detect 100 % of faults injected into an AES crypto-accelerator while ensuring wide circuit area coverage, with a highly negligible ≈1% false alarms rate thanks to the proposed differential fault detection methodology. To ascertain the sensor's robustness, experiments were conducted under various thermal and noise conditions. Beyond fault detection, the sensor provides insight into the EMFI mechanism. The observed behavior is consistent with a timing constraint violation fault model.
TS28.7 GPU Acceleration of the Sum-Check Protocol Over Towers of Binary Fields for Verifiable Computing 09:00
Andrew Fan¹, Yanze Wu¹, Harry Han¹, Md Tanvir Arafin¹

¹ George Mason University

Emerging zero-knowledge proof protocols such as Binius and Binius-FRI operate over towers of binary fields, allowing for ultra-fast polynomial commitments over a base field. Sum-check, a key protocol in algebraic proof systems, is one of the key implementation bottlenecks for Binius and similar protocols. While sum-check is a massively parallel algorithm, GPU acceleration of sum-check has received little attention due to the lack of native GPU support for binary field multiplication. Hence, in this paper, we explore the key issues in existing GPU-based sum-check accelerators and present SumCATS - an efficient GPU implementation for sum-check acceleration. SumCATS leverages two fundamental improvements over the existing solutions. First, it adapts a CPU-based algorithmic improvement to sum-check proving and applies it to GPUs by recognizing the reduction pattern and shared memory optimizations. Secondly, SumCATS reduces the number of global memory accesses by precomputing products of random challenges and using base field operations to reconstruct extension field elements. When these optimizations are combined, SumCATS achieves a significant speedup ($1.81\times$ on NVIDIA RTX 3090 Ti, $1.62\times$ on NVIDIA A100) over the baseline GPU implementation (Binius-GPU) for sum-check over binary tower fields. The code and research artifacts for SumCATS design are available at \url{https://github.com/SPIRE-GMU/sum_cats}
TS28.8 Formal Methods-Assisted Chosen Ciphertext Attacks on PQC CRYSTALS-Kyber Using Electromagnetic Emanations 09:05
Yashaswini Makaram¹, Davis Ranney¹, Aidong Adam Ding¹, David Kaeli¹, Yunsi Fei¹

¹ Northeastern University

NIST has released a set of post-quantum cryptography (PQC) standards that address the threat posed by the emerging quantum computing. The standard includes a modular lattice-based key exchange mechanism (ML-KEM) based on the CRYSTALS-Kyber algorithm. Recent work has shown that Kyber is susceptible to electromagnetic (EM) and power side-channel attacks. Full understanding of the side-channel vulnerabilities in Kyber is of paramount importance to next-generation communication and computing infrastructures. In this study, we target a previously unexplored section of the Kyber algorithm and implement a chosen ciphertext side-channel attack. We focus our attack on the Barrett reduction operation in the decapsulation algorithm. Compared to previous attacks on Barrett reduction, which was on variables after Inverse-Number Theoretic Transform (INTT), we target the Barrett reduction on NTT variables, allowing for more general chosen ciphertexts that can evade input sanity checking. We design a scheme that only requires a set of 12 ciphertexts and side-channel EM traces of the corresponding decapsulation processes, which can show distinct leakages under different key values. The secret key is retrieved by pattern matching of the EM leakages. We develop an algorithm utilizing an SMT solver to automatically choose such set of ciphertexts. We implement Kyber on an ARM Cortex-M4 based micro-controller and launch this new EM side-channel attack. Our results show that the attack achieves a success rate of over 95\% in terms of recovering the secret key value.
TS28.9 Synthesizable PUF Design with Library Characterization for Secure Storage in Edge Devices 09:11
Yuseong Lee¹, Donghyun Park², Jang Hyun Kim³, Jongmin Lee⁴

¹ Department of Intelligence Semiconductor Engineering, Ajou University ; ² Sungkyunkwan university ; ³ Ajou Univeristy ; ⁴ Ajou University

This work presents a synthesizable physically unclonable function (PUF) with library characterization that enables stable and secure key generation (KG) for edge-device storage. By combining sample-and-hold inverter-chain with digital tilting, masking, and optional temporal majority voting (TMV), the proposed design achieves low bit flip rate (BFR) under environmental variations and integrates with advanced encryption standard (AES) with only 5.72% area overhead. The results verify robust stability, uniqueness, and randomness, demonstrating its practicality for hardware-based secure storage.
TS28.10 Optimization of AND-Gate-Sparse Circuit Synthesis for Multi-party Computation Systems with Local Communications 09:12
Yinfan Zhao¹, Makoto Ikeda¹

¹ University of Tokyo

Multi-party computation (MPC) based on secret sharing is historically bottlenecked by the round complexity of AND gates, prompting previous research to prioritize AND-depth minimization. However, this strategy targets high-latency networks (e.g., the Internet) and is suboptimal for emerging low-latency environments like data centers. In this work, we demonstrate that in low-latency settings, the AND-Gate count is the dominant performance factor, outweighing AND-Gate depth. We propose an automated "AND-Gate-Sparse" synthesis flow that leverages a customized cell library and an after-synthesis process to minimize AND count. Experimental results show our approach reduces AND gates by up to 30% to 90% and improves evaluation time by 23% to 55% compared to state-of-the-art solutions.

TS29 LLMs and Emerging Architectures for Intelligent Hardware Generation 08:30 - 10:00 | Tosca

Chair: Cantoro Riccardo (Politecnico di Torino, IT)

TS29.1 LiveVerilogEval: Contamination Free and Automatically Scalable Benchmark for Verillog Code Generation 08:30
Charles Young¹, Hao Yu², Dezhi Ran¹, Qingchen Zhai³, Tianqi Qiu¹, Frank Qu⁴, Bangyang Wang⁵, Yuan Xie⁶, Tao Xie¹

¹ Peking University ; ² The Hong Kong University of Science and Technology ; ³ Institute of Automation, Chinese Academy of Sciences ; ⁴ University of California Santa Barbara ; ⁵ HKUST ; ⁶ Hong Kong University of Science and Technology

Verilog code generation has emerged as a critical application for Large Language Models (LLMs) in Electronic Design Automation (EDA). However, existing benchmarks suffer from data contamination issues where training datasets overlap with evaluation problems, leading to artificially inflated performance. Additionally, periodically creating new benchmark problems is often too cost-prohibitive to be maintained by humans. In this paper, we propose LiveVerilogEval, a dynamic framework that automatically generates novel evaluation problems from existing RTL designs. LiveVerilogEval addresses both challenges by automatically generating mutated variants of valid Verilog designs while maintaining semantic correctness. Our experimental results demonstrate significant performance degradation across state-of-the-art LLMs when evaluated on LiveVerilogEval-enhanced benchmarks compared to traditional static benchmarks, revealing that LLM-based Verilog generation remains challenging and confirming the effectiveness of our contamination-free evaluation approach.
TS29.2 ACES: A Chiplet Architecture with Resource Partition and Dynamic Scheduling for Agentic LLMs 08:35
Hongou Li¹, Mingxuan Li¹, Zhantong Zhu¹, Tianyu Jia¹

¹ Peking University

Agentic LLM is an emerging working paradigm leveraging large language models (LLMs) as assistant agents for complex, multi-step tasks. However, the operations of LLM agents also create unique workload characteristics with highly dynamic resource demands. In this work, we propose ACES, a co-designed solution leveraging scalable chiplet architecture together with dynamic workload scheduling for agentic LLMs. At hardware level, the chiplet architecture is designed to support a zoned fabric with flexible Swing Zones. Upon this, at software level, a conversation-centric dynamic scheduling approach is adopted, which includes topology-aware homing, proactive caching, and adaptive resource conversion to accommodate to data locality and resource balance. We evaluate our architecture using Llama-3 8B models on representative agentic-RAG-derived tasks. The system evaluations demonstrate that our design achieves 2.33× throughput improvement and an average 58% conversation latency reduction compared to state-of-the-art DistServe-like chiplet baselines, showcasing superior performance and scalability.
TS29.3 When Forgetting Builds Reliability: LLM Unlearning for Reliable Hardware Code Generation 08:40
Yiwen Liang¹, Qiufeng Li², Shikai Wang¹, Weidong Cao¹

¹ The George Washington University ; ² George Washington University

Large Language Models (LLMs) have shown strong potential in accelerating digital hardware design through automated code generation. Yet, ensuring their reliability remains a critical challenge, as existing LLMs trained on massive heterogeneous datasets often exhibit problematic memorization of proprietary intellectual property (IP), contaminated benchmarks, and unsafe coding patterns. To mitigate these risks, we propose a novel unlearning framework tailored for LLM-based hardware code generation. Our method combines (i) a syntax-preserving unlearning strategy that safeguards the structural integrity of hardware code during forgetting, and (ii) a fine-grained floor-aware selective loss that enables precise and efficient removal of problematic knowledge. This integration achieves effective unlearning without degrading LLM code generation capabilities. Extensive experiments show that our framework supports forget sets up to 3× larger, typically requiring only a single training epoch, while preserving both syntactic correctness and functional integrity of register-transfer level (RTL) codes. Our work paves an avenue towards reliable LLM-assisted hardware design.
TS29.4 LongRTL: Graph-Similarity-Guided LLM-driven Long Context RTL Optimization 08:45
Yuyang Ye¹, Che-Kuan Shen², Xiangfei Hu³, Yuchen Liu³, Shuo Yin¹, Xufeng Yao⁴, Bei Yu¹, Tsung-Yi Ho¹

¹ The Chinese University of Hong Kong ; ² National Central University ; ³ Southeast University ; ⁴ Chinese University of HongKong

Large Language Models (LLMs) show great promise in RTL code generation and optimization. However, real-world RTL designs are typically long, entangled, and poorly modularized—posing a major challenge due to context-length limitations and lack of structure. To overcome these obstacles, we propose a scalable LLM-based RTL optimization framework guided by graph similarity. Our method introduces three collaborative agents: (1) a Partition Agent that decomposes RTL designs into semantically meaningful AST subtrees, guided by AST graph similarity to reusable design templates; (2) an Optimization Agent that generates RTL submodule code based on partitioned subtrees using multi-modal Retrieval-Augmented Generation (RAG) with both AST and RTL guidance; and (3) a Reconstruction Agent that reassembles optimized submodules based on logic-aware ordering and Graph-RAG prompting, ensuring global functional equivalence. Together, these components enable robust, structure-aware optimization of long-context RTL designs, bridging the gap between toy examples and industrial-scale hardware codebases.
TS29.5 MeltRTL: Multi-Expert LLMs with Inference-time Intervention for RTL Code Generation 08:50
Nowfel Mashnoor¹, Mohammad Akyash¹, Hadi Kamali¹, Kimia Zamiri Azar¹

¹ University of Central Florida

The automated generation of hardware register-transfer level (RTL) code with large language models (LLMs) shows promise, yet current solutions struggle to produce syntactically and functionally correct code for complex digital designs. This paper introduces MeltRTL, a novel framework that integrates multi-expert attention with inference-time intervention (ITI) to significantly improve LLM-based RTL code generation accuracy without retraining the base model. MeltRTL introduces three key innovations: (1) A multi-expert attention architecture that dynamically routes design specifications to specialized expert networks, enabling targeted reasoning across various hardware categories; (2) an inference-time intervention mechanism that employs non-linear probes to detect and correct hardware-specific inaccuracies during generation, and (3) an efficient intervention framework that selectively operates on expert-specific attention heads with minimal computational overhead. We evaluate MeltRTL on theVerilogEval benchmark, achieving 96% synthesizability and 60% functional correctness, compared to the base LLM's 85.3% and 45.3%, respectively. These improvements are obtained entirely at inference time, with only 27% computational overhead and no model fine-tuning, making MeltRTL immediately deployable on existing pre-trained LLMs. Ablation studies further show the complementary benefits of multi-expert architecture and ITI, highlighting their synergistic effects when combined.
TS29.6 MemoryIslands: A Federated Approach for Efficient Memory Mappings 08:55
Fatemeh Derakhshani¹, Mohamed Hassan¹

¹ McMaster University

Modern computing systems increasingly feature diverse co-running workloads with varying memory access patterns and requirements. However, existing main memory architectures employ rigid, application-agnostic memory mapping strategies, leaving significant performance potential untapped. This paper introduces MemoryIslands, a novel methodology for treating main memory as a federated collection of independent regions, or "islands," each tailored to the unique memory demands of individual applications. Our contributions include: (1) a profiling-based methodology to identify optimal address mappings for diverse workloads, (2) a static optimization framework for partitioning memory resources into performance-optimized islands, and (3) a software-aware hardware co-design approach that configures memory controllers to leverage these islands without requiring changes to the software-hardware interface. Evaluation across over 80 workloads on single- and multi-core systems demonstrates significant performance improvements—up to 50%—compared to state-of-the-art static mapping techniques. By enabling application-specific memory mappings and partitioning, MemoryIslands provides a scalable and efficient solution to address the limitations of existing memory architectures.
TS29.7 DPUConfig: Optimizing ML Inference in FPGAs Using Reinforcement Learning 09:00
Alexandros Patras¹, Spyros Lalis¹, Christos Antonopoulos¹, Nikolaos Bellas¹

¹ University of Thessaly

Heterogeneous embedded systems, with diverse computing elements and accelerators such as FPGAs, offer a promising platform for fast and flexible ML inference, which is crucial for services such as autonomous driving and augmented reality, where delays can be costly. However, efficiently allocating computational resources for deep learning applications in FPGA-based systems is a challenging task. A Deep Learning Processor Unit (DPU) is a parameterizable FPGA-based accelerator module optimized for ML inference. It supports a wide range of ML models and can be instantiated multiple times within a single FPGA to enable concurrent execution. This paper introduces DPUConfig, a novel runtime management framework, based on a custom Reinforcement Learning (RL) agent, that dynamically selects optimal DPU configurations by leveraging real-time telemetry data monitoring, system utilization, power consumption, and application performance to inform its configuration selection decisions. The experimental evaluation demonstrates that the RL agent achieves an energy efficiency that is 95% (on average) of the optimal attainable energy efficiency for several CNN models on the Xilinx Zynq UltraScale+ MPSoC ZCU102.
TS29.8 A Collaborative Framework for Multi-Level Multi-Objective Design Space Exploration 09:05
Jiangnan Li¹, Kaixiang Zhu², Yuping Bai², Yunfei Dai², Qing He³, Yu He⁴, Lingli Wang²

¹ State Key Laboratory of Integrated Chips and System, Fudan University ; ² Fudan University ; ³ Tongji University ; ⁴ Hangzhou Phlexing Technology Co., Ltd

High-level synthesis (HLS) tools have drawn considerable attention in recent years because they can automatically generate hardware description code from high-level semantics under compiler-controlled configurations. However, the time-consuming design process, the inherent trade-offs among design objectives, and the often suboptimal quality of RTL produced by HLS have meant that prior studies rarely scale to or investigate the downstream stages beyond HLS. In this paper, we present COLA, an end to end design space exploration (DSE) framework that effectively automates the adaptive tuning of compiler transformation sequences and logic synthesis directives. First, we introduce MOEBO, a holistic Bayesian optimization method that builds multiple local surrogate models within trust regions while maintaining a global surrogate to correct local search bias and align decisions across regions. We further design a cooperative acquisition maximization scheme that coordinates these surrogates to propose diverse and promising candidates in parallel. Additionally, we employ reinforcement learning (RL) techniques to optimize logic synthesis by exploring the design space more effectively, improving the quality of the generated RTL and minimizing the area-delay product. The RL model dynamically adapts the logic synthesis directives to achieve better optimization outcomes over traditional methods. Experimental results show that, our novel framework achieves a substantial speedup across diverse accelerators for varying kernel granularities with a better trade-off between area and performance.
TS29.9 HAWX: A Hardware-Aware FrameWork for Fast and Scalable ApproXimation of DNNs 09:10
Samira Nazari¹, Mohammad Saeed Almasi¹, Mahdi Taheri², Ali Azarpeyvand³, Ali Mokhtari⁴, Ali Mahani⁴, Christian Herglotz⁵

¹ University of Zanjan ; ² BTU Cottbus ; ³ Tallinn University of Technology ; ⁴ Shahid Bahonar University of Kerman ; ⁵ Brandenburg University of Tevhnology

This work presents HAWX, a hardware-aware scalable exploration framework that employs multi-level sensitivity scoring at different DNN abstraction levels (operator, filter, layer, and model) to guide selective integration of heterogeneous AxC blocks. Supported by predictive models for accuracy, power, and area, HAWX accelerates the evaluation of candidate configurations, achieving over 23× speedup in a layer-level search with two candidate approximate blocks and more than (3×10^6)× speedup at the filter-level search only for LeNet-5, while maintaining accuracy comparable to exhaustive search. Experiments across state-of-the-art DNN benchmarks such as VGG-11, ResNet-18, and EfficientNetLite demonstrate that the efficiency benefits of HAWX scale exponentially with network size. The HAWX hardware-aware search algorithm supports both spatial and temporal accelerator architectures, leveraging either off-the-shelf approximate components or customized designs.
TS29.10 SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence 09:15
Saoud Aldowaish¹, Yashwanth Karumanchi¹, Kai-Chen Chiang¹, Soroosh Noorzad¹, Morteza Fayazi¹

¹ University of Utah

Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision–Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than the state-of-the-art approaches.

TS30 Reconfigurable systems, accelerators and tools 08:30 - 10:00 | Nabucco

Chair: Dionisios Pnevmatikatos (National Technical University of Athens & ICCS, GR)

Co-Chair: Jona Beysens (KU Leuven, BE)

TS30.1 StreamNTT: A High-Throughput, HLS-Based Streaming NTT Accelerator for HBM-Equipped FPGAs 08:30
Young-kyu Choi¹, Hyunwoo Park¹, Wei He², Sunwoong Kim²

¹ Inha University ; ² Rochester Institute of Technology

Lattice-based post-quantum cryptography (PQC) relies extensively on the number-theoretic transform (NTT), and large-scale deployments require throughput-optimized accelerators capable of computing thousands of NTTs per session. To address this need, we present StreamNTT, a high-throughput NTT accelerator developed using high-level synthesis (HLS) for field-programmable gate arrays (FPGAs) equipped with high bandwidth memory (HBM). StreamNTT effectively leverages the parallelism inherent in the NTT by overcoming several obstacles that limit the scalability of streaming dataflow designs. Specifically, we introduce HLS-friendly butterfly units that integrate butterfly computation and reorder buffering into a single pipelined loop. In addition, we propose a butterfly merging strategy that reduces first-in, first-out channels and buffers. Finally, we present a placement-aware instance-level parallelism scheme for FPGA platforms with multiple HBM channels and super logic regions. On the Alveo U280 platform, the proposed optimizations improve throughput by a factor of 7.2 and achieve a speedup of more than 2.7 times compared to state-of-the-art FPGA-based NTT accelerators. As an open-source design, StreamNTT establishes a scalable foundation for advancing high-performance PQC acceleration.
TS30.2 RouterAcc: FPGA Acceleration for VLSI Detailed Router via Hierarchical Storage Mapping 08:35
Ruiyuan Guo¹, Zexu Zhang¹, Chang Liu¹, Da Tang², Weiqi Shen¹, Haodong Lu¹, Xiqiong Bai³, Kun Wang¹, Jianli Chen⁴, Jun Yu¹

¹ Fudan University ; ² Nanjing University of Posts and Telecommunications ; ³ Nanjing University Of Posts And Telecommunications ; ⁴ Fuzhou University

Detailed routing constitutes a critical phase in the very large-scale integration (VLSI) physical design, widely regarded as the most time-consuming and computationally intensive step in the back-end design process. Due to its iterative nature and strong data dependencies, conventional parallel acceleration techniques often suffer from limited scalability and effectiveness. To address these challenges, we propose RouterAcc, an FPGA-based software–hardware co-design acceleration framework tailored for VLSI detailed routing. RouterAcc incorporates an access analysis mechanism and a termination condition strategy to accelerate convergence. Furthermore, we employ a hierarchical storage mapping scheme and a flexible dimension-partitioning architecture to alleviate memory bottlenecks and enhance data locality. Additionally, RouterAcc leverages a hierarchical comparison pipeline with fully parallelized computing units and a data preprocessing strategy to maximize computational efficiency. Experimental results on the ISPD'18 benchmarks demonstrate that RouterAcc achieves consistent speedups of 2.1×–2.3× over TritonRoute with less than 1% quality degradation. With further co-optimization, RouterAcc attains speedups of 2.7×–11.8× while maintaining routing quality comparable to TritonRoute and surpassing Dr.CU 2.0 as well as the state-of-the-art (SOTA) FPGA-based approaches.
TS30.3 Compacted-LUT: Fine-Grained Customizable LUT Architecture via SRAM-MUX Co-Optimization 08:40
Mingyang Chen¹, Yunfei Dai¹, Wai-shing Luk¹, Qing He², Yu He³, Lingli Wang¹

¹ Fudan University ; ² Tongji University ; ³ Hangzhou Phlexing Technology Co., Ltd

Traditional FPGA PLB designs are constrained by the exponential increase in LUT area with the augmentation of inputs. Recent work has explored a pruned LUT based on the non-uniform distribution of Boolean functions in practical benchmarks, designing an 8-input PLB with enhanced functionality and a modest area overhead. Nonetheless, the existing LUT pruning algorithm is prone to local optima and focuses exclusively on SRAM pruning, neglecting lookahead optimization of the MUX tree. In this paper, we propose Compacted-LUT (CLUT), a fine-grained customizable LUT architecture via SRAM-MUX co-optimization. Based on the principle of LUT pruning, we design a novel representation for Boolean functions. This representation directly associates each Boolean function with the number of required SRAMs and MUX-tree transistors. On this basis, a novel evaluation model for the hardware-friendliness of Boolean functions can be formulated. We further design a beam search algorithm to identify an optimal subset of Boolean functions in target benchmarks based on evaluation results. With this subset, the customizable SRAM-MUX co-optimized CLUT architecture can be generated. Furthermore, we propose Asym-CLUT6, a function-diverse 8-input PLB composed of two variant 6-input CLUTs. We evaluate Asym-CLUT6 on VTR and Koios benchmarks. Post-route results show that, compared to the Altera Stratix 10-like architecture and Dual-RLUT6, Asym-CLUT6 reduces the area-delay product by 13.65% and 10.06% on average.
TS30.4 FLAME: A Framework Exploring Execution Strategies for Multi-Cycle Operations in CGRA 08:45
Jiajun Qin¹, Cheng Tan², Ruihong Yin³, Tianhua Xia⁴, Sai Qian Zhang⁴, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² Google/ASU ; ³ University of Minnesota ; ⁴ New York University

Effective mapping of dataflow graphs onto Coarse-Grained Reconfigurable Arrays necessitates compiler-architecture co-design, yet existing approaches frequently assume single-cycle operations despite real-world applications often involving multi-cycle operations that constrain achievable clock frequencies. To address this, we propose FLAME, a novel framework supporting three execution strategies (exclusive, distributed, inclusive) specifically designed for multi-cycle operations, with co-designed compiler and hardware support. Our evaluations demonstrate that FLAME not only surpasses prior methods in performance and but also enables flexible exploration of these operations. The framework achieves average speedups of 2.21x over baseline CGRA and 1.49x over prior state-of-the-art framework while highlighting the distinct characteristics of each strategy.
TS30.5 NX-CGRA: A Programmable Hardware Accelerator for Core Transformer Algorithms on Edge Devices 08:50
Rohit Prasad¹

¹ CEA

The increasing diversity and complexity of transformer workloads at the edge present significant challenges in balancing performance, energy efficiency, and architectural flexibility. This paper introduces NX-CGRA, a programmable hardware accelerator designed to support a range of transformer inference algorithms, including both linear and non-linear functions. Unlike fixed-function accelerators optimized for narrow use cases, NX-CGRA employs a coarse-grained reconfigurable array (CGRA) architecture with software-driven programmability, enabling efficient execution across varied kernel patterns. The architecture is evaluated using representative benchmarks derived from real-world transformer models, demonstrating high overall efficiency and favorable energy-area tradeoffs across different classes of operations. These results indicate the potential of NX-CGRA as a scalable and adaptable hardware solution for edge transformer deployment under constrained power and silicon budgets.
TS30.6 MIQARA: Mixed-Criticality Queue-based Architecture for Reconfigurable Accelerator Platforms 08:55
Hassan Nassar¹, Martin Rapp², Lars Bauer³, Mostafa Elshimy⁴, Zeynep Demirdag⁵, Joerg Henkel⁶

¹ Karlsruher Institut für Technologie ; ² Bosch AI Research ; ³ Independent Scholar ; ⁴ German University in Cairo ; ⁵ Karlsruhe Institute of Technology (KIT) ; ⁶ KIT

Coexistence of safety-critical control functions and best-effort computations in mixed-criticality systems poses a challenge in resource allocation and scheduling, as high-criticality jobs must adhere to strict timing guarantees, while lower-criticality jobs should make effective use of available resources without compromising the system's safety and predictability. This paper introduces MIQARA1, a mixed-criticality queue-based architecture designed for reconfigurable accelerator platforms. MIQARA efficiently combines software-programmable CPUs with reconfigurable hardware, utilizing a dynamic job pipeline, token-based dependency tracking, and out-of-order scheduling to optimize resource utilization. At the same time, MIQARA has been designed to satisfy real-time constraints. MIQARA is evaluated on four FPGA platforms: the Zed Board, DipForty board, ZCU102 board, all of which have ARM CPUs implemented on chip, and Arty A7 with a RISC-V soft-core processor, representing systems that rely on soft CPUs. Results demonstrate substantial performance gains, particularly in terms of execution speed, flexibility, and adaptability to mixed-criticality workloads. The integration of features such as a streaming network further illustrates MIQARA's scalability to complex data-intensive applications, making it a compelling solution for embedded mixed-criticality systems. MIQARA requires a hardware overhead of 17.8% and achieves a speedup of up to 4×.
TS30.7 MC-CGRA: A Memory-Computation Coordinated CGRA Framework for Stream Processing 09:00
Chen Shi¹, Chunhua Xiao¹, Han Diao¹, Weijie Yuan¹, Junling Wang¹

¹ College of Computer Science, Chongqing University

Coarse-Grained Reconfigurable Arrays (CGRAs) have emerged as a promising platform for domain-specific accelerators. However, traditional designs face significant limitations in large-scale stream processing. Current decoupled software-hardware partitioning approaches often result in underutilized hardware parallelism and suboptimal memory organization, which severely constrains the scalability of kernel implementations. Consequently, large-scale kernels fail to fully exploit their inherent data locality, resulting in frequent off-chip memory accesses and overhead from repeated invocations, thereby substantially degrading overall performance. To address these challenges, this paper introduces MC-CGRA, a memory-computation coordinated CGRA framework. By leveraging its novel Chain-of-Computation (CoC) model, which uniformly represents operations within kernels as stream nodes, MC-CGRA achieves seamless coordination between memory access and pipelined computation through a software-defined approach. The framework incorporates a stream-centric CGRA microarchitecture to minimize frequent data exchanges between large-scale stream computing kernels and off-chip memory. An MC-CGRA prototype with an 8×10 PE array has been implemented on the AMD/Xilinx VCU118 platform. Experimental results show that the prototype combines fast compilation with sustained high throughput for stream processing kernels of varying scales, underscoring its efficiency in real-time scenarios. The prototype attains an average performance of 29.73 GOPS, outperforming state-of-the-art solutions by 1.55× and 1.62× in FFT and FIR workloads, respectively.
TS30.8 CUT-MC: Optimizing the Relationship Between Context Count, Unrolling and Throughput in Multi-Context CGRAs 09:05
Stephen Wicklund¹, Jason H. Anderson¹

¹ University of Toronto

We propose techniques to raise throughput for compute-intensive loops when implemented in multi-context CGRAs via judicious choices for context count and loop-unroll factor. We show that using a non-minimal initiation interval (II) can yield higher throughput in many applications (vs. using the minimum II), provided there is sufficient loop unrolling. Average throughput improvements of 36% are achieved across a range of applications, CGRA sizes and context counts.
TS30.9 RIFT: A Single-Bitstream, Runtime-Adaptive FPGA-Based Accelerator for Multimodal AI 09:06
Hyunwoo Oh¹, Hanning Chen¹, Sanggeon Yun¹, Yang Ni², Suyeon Jang¹, Behnam Khaleghi³, Fei Wen⁴, Mohsen Imani⁵

¹ University of California, Irvine ; ² Purdue University Northwest ; ³ University of California, San Diego ; ⁴ Texas A&M University ; ⁵ University of California Irvine

Multimodal models spanning ViTs, CNNs, GNNs, and NLP stress embedded systems because their heterogeneous compute and memory behaviors complicate resource allocation, load balancing, and real-time inference. We present RIFT, a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference. RIFT unifies layers as DDMM/SDDMM/SpMM kernels executed on a runtime mode-switchable engine that morphs among weight-/output-stationary systolic, 1×CS SIMD, and a routable adder tree (RADT) on a shared datapath. A two-stage hardware top-k unit, width-matched to the array, performs in-stream token pruning with minimal buffering, and dependency-aware scheduling overlaps independent kernels across multiple RPUs—achieving adaptation without bitstream reconfiguration. On Alveo U50 and ZCU104, RIFT reduces latency by up to 22.57× versus an RTX 4090 and 6.86× versus a Jetson Orin Nano at ∼20–21W; pruning alone yields up to 7.8× on ViT-heavy workloads.
TS30.10 A Cluster-Based Distributed Memory Architecture for CGRAs 09:07
Shangkun LI¹, Cheng Tan², Zeyu LI³, Jinming Ge¹, Jiawei Liang⁴, Hao Yang⁵, Linfeng Du³, Jiang Xu⁶, Wei Zhang¹

¹ Hong Kong University of Science and Technology ; ² Google/ASU ; ³ The Hong Kong University of Science and Technology ; ⁴ HKUST ; ⁵ George Washington University ; ⁶ Hong Kong University of Science and Technology (Guangzhou)

Coarse-Grained Reconfigurable Arrays (CGRAs) are a promising solution for achieving high energy efficiency and reconfigurability across various application domains. However, their practical performance is frequently impeded by centralized memory architectures. Conventional CGRA architectures restrict direct memory access to specific tile locations, forcing extensive data routing that competes with computation resources. To mitigate this routing overhead, this work introduces a cluster-based distributed memory architecture. By partitioning the array into clusters that share localized memory units and utilizing a lightweight coherence mechanism, the proposed design enhances data accessibility. Evaluation demonstrates that this distributed memory architecture for CGRAs yields an average speedup of 1.34x compared to traditional global-memory baseline CGRAs.

TS31 Beyond the Core: Optimising, Scaling and Securing Heterogeneous Architectures 08:30 - 10:00 | Aida

Chair: Nicola Dall'Ora (Guglielmo Marconi University, IT)

Co-Chair: Hana Krichene (CEA, FR)

TS31.1 Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators 08:30
Rebecca Pelke¹, Joel Klein², Nils Bosbach¹, José Cubero-Cascante¹, Jan Moritz Joseph³, Rainer Leupers¹

¹ RWTH Aachen University ; ² RWTH Aachen University, Germany ; ³ RooflineAI GmbH

Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48× speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
TS31.2 How to Manage the Mapping Table of Large-Capacity Solid-State Drives 08:35
Yang Zhou¹, Fang Wang¹, Dan Feng¹

¹ Huazhong University of Science and Technology

The management of mapping tables within solid-state drives (SSDs) is critical to maintaining high performance and stability. Traditional fixed-granularity mapping approaches (page-level, block-level) face significant challenges in ultra-large-scale 3D NAND flash memory, including high DRAM cache overhead resulting from page mapping and high write amplification caused by block mapping. Existing hybrid mapping solutions fail to adapt to the dynamic-varying workloads and large-capacity SSDs, resulting in unstable performance and poor scalability. In this paper, we propose a novel light-weight learning-based combination mapping technique called CoMap, which can adaptively choose the optimal mapping scheme so as to effectively prevent excessive page-mapping table sizes and poor block mapping performance in large-capacity SSDs. Comprehensive evaluations with variety of real-world workloads and enterprise-class SSD simulator show that CoMap, with its strong adaptability, can reduce 60% of latency and 30% of write amplification on average compared with several state-of-the-art methods.
TS31.3 FEDCM: Fine-grained Kernel Scheduling and Management to Improve GPU Sharing 08:40
Xianwei Zhang¹, Xuanteng Huang¹, Nong Xiao¹

¹ Sun Yat-sen University

ML (machine learning) inference tasks are common workloads in data centers and supercomputers. ML inference tasks tend to be short-running jobs that under-utilize GPUs, one of the most important system resources in modern computing systems. Scheduling multiple inference tasks on one GPU helps improve GPU throughput and resource utilization. However, existing GPU sharing adopts either coarse-grained collocation strategies or fine-grained isolated spatial partition strategies that produce suboptimal results. In this paper, we propose FEDCM, a kernel-level collocation-based GPU sharing scheme to establish a federated use of compute and on-chip memory resources. FEDCM evaluates the collocation effectiveness of ready kernels and dispatches them in a way to maximize the system throughput. During collocated execution, FEDCM adopts kernel-wise management to arbitrate cache usage via customizing cache policies. The evaluation of our implementation on the off-the-shelf GPUs demonstrates that FEDCM improves the overall throughput by 31.3% and 15.4%, compared to standard sharing baseline and prior state-of-the-art, respectively.
TS31.4 RheoSparse: Exploring Finer-Grained Structured Sparsity for Small Language Models 08:45
Jianing Zheng¹, Gang Chen²

¹ Sun Yat-Sen University ; ² Sun Yat-sen University

Small Language Models (SLMs) are designed for efficient on-device deployment, but compressing them without significant accuracy loss remains challenging. Structured sparsity methods like N:M pruning, which removes N parameters out of every M, are widely used to improve hardware efficiency. However, their rigid patterns, such as the commonly adopted 4:8 format, often degrade SLM performance, since these models are more sensitive to parameter removal than larger ones. We observe that finer-grained patterns like N:64 can better preserve accuracy under the same sparsity budget, yet current inference systems do not efficiently support them, especially during token generation. Furthermore, applying such fine-grained sparsity uniformly across all layers is suboptimal, as different layers respond differently to pruning. To address this, we propose RheoSparse. First, we use coarse-to-fine evolutionary search to assign sparsity levels across layers under a global budget. Second, we design a highly-optimized Sparse Matrix-Vector Multiplication (SpMV) kernel that efficiently supports arbitrary structured sparsity patterns during token generation. For example, on Qwen2.5-1.5B, RheoSparse reduces perplexity (PPL) by 33.09% and improves downstream task performance by 9.3% compared to 4:8 sparsity, while maintaining the same parameter count. Furthermore, our SpMV kernel outperforms the state-of-the-art sparse kernel SpInfer by up to 49.6%.
TS31.5 Software-Based Approximate Multiplication on Multiplierless CPUs Using Custom Instruction 08:50
Shalu Prathmesh Rajiv¹, Rajesh Kedia¹

¹ Indian Institute of Technology Hyderabad

Many lightweight, low-power microcontrollers contain CPU cores without hardware multipliers. When deployed for applications involving multiplication, such devices emulate multiplication using computationally expensive software techniques. While existing works used approximate hardware multipliers to reduce the complexity of multiplication, there are limited works on approximate multiplication in software. Translating ideas from approximate hardware multipliers directly to a software code does not provide considerable improvements. In this work, we study common approaches for approximate hardware multipliers and then propose a custom instruction for leading-one detection (LOD) which significantly reduces the complexity of performing approximate multiplication in software. We implement LOD instruction in a RISC-V based core and use it to develop seven different software routines for approximate multiplication, and evaluate them on four different popular kernels. These routines are based on existing approximate hardware multipliers and three newly proposed techniques. These routines provide multiple points of trade-off between error rate and computation cycles, enabling configurable choices to the designer. When multiplying 1 million random pairs of numbers, one of the proposed techniques, RoBA-(UpDn)2, enables a very low error rate of 0.05% on average and 0.83% as maximum error; while consuming only 51% of the original CPU cycles. When deployed on two image processing applications, RoBA-(UpDn)2 can provide a signal to noise ratio (SNR) larger than 50 dB; while consuming about 70% of the original computation cycles.
TS31.6 SSR: Sparse Segment Reduction for Ternary GEMM Acceleration 08:55
Adeline Pittet¹, Shien Zhu¹, Valerie Verdan¹, Gustavo Alonso¹

¹ ETH Zurich

Large Language Models (LLMs) require substantial computational resources, limiting their deployment on resource-constrained hardware. Ternary LLMs mitigate these demands through weight quantization via ternary values, achieving significant compression often with 50-90% sparsity. However, existing approaches have limitations: methods optimized for ternary weights, such as BitNet, redundant segment reduction (RSR), and its improved version RSR++, do not exploit sparsity structures, while conventional sparse formats neglect ternary characteristics, foregoing dual optimization opportunities. In this paper, we introduce Sparse Segment Reduction (SSR), a ternary matrix multiplication method designed to accelerate the inference of ternary LLMs and general Ternary Weight Networks (TWNs). SSR has a dedicated optimized ternary data format and an algorithm that systematically exploits sparsity patterns through computation trees that scale with the sparsity. SSR provides theoretical gains with asymptotically faster inference than RSR++ for sparsity above 50%, while practical evaluations reveal performance improvements across all sparsity levels. Evaluation results show that SSR achieves 2.1-11.3× speedup over RSR++ on ternary GEMM with 45-95% sparsity. Furthermore, SSR achieves 3.5-6.3× end-to-end speedup and 4.9% of memory saving over RSR++ on the Llama-3 1B model inference.
TS31.7 EEMU: A FEMU-based Accurate, Parametric Ethernet-SSD Emulator 09:00
Xikun Jiang¹, Chao Li², Tianyu Wang³, Xiaowei Chen², Zhaoyan Shen⁴, Zili Shao¹

¹ The Chinese University of Hong Kong ; ² Inspur Group Co., Ltd ; ³ Shenzhen University ; ⁴ Shandong University

Ethernet-SSDs incorporate built-in Ethernet connectivity, allowing them to function as standalone network-attached storage devices. This architecture enables efficient and scalable disaggregated storage by providing direct network access without host dependencies. Understanding their internal architecture and fine-grained performance interactions is critical for advancing this emerging design, yet is difficult to study on proprietary hardware. In this paper, we present EEMU, a configurable and accurate Ethernet-SSD emulator built on the FEMU framework. EEMU reproduces the end-to-end NVMeoF I/O translation path and models latency across three key components: the Network Interface, the NVMeoF Target Module, and the AXI Bus. It further employs a fine-grained parametric latency model that decomposes end-to-end latency into component-level contributions, enabling systematic exploration of architectural trade-offs via controlled parameterization. Experiments show that EEMU closely matches both component-level behaviors and end-to-end performance across diverse Ethernet-SSD configurations.
TS31.8 An Efficient Secure Boot Mechanism Leveraging DICE as a Use Case 09:01
Utku Budak¹, Malek Safieh², Yigit Arda Ozen¹, Fabrizio De Santis², Georg Sigl³

¹ Technical University of Munich (TUM) - Chair of Security in Information Technology ; ² Siemens AG, Foundational Technologies ; ³ Technical University of Munich/Fraunhofer AISEC

Secure boot ensures that only verified code is executed at boot time. It typically relies on asymmetric cryptography, which may pose boot time challenges for time-critical devices. We, therefore, propose an efficient secure boot (ESB) mechanism that extends the asymmetric cryptography-based approach with symmetric cryptography to reduce boot time. To demonstrate the practicality, an extended Device Identifier Composition Engine (DICE) architecture is leveraged as a use case. The evaluation results on an ARM-based MCU show that the proposed mechanism reduces boot time for regular boots while introducing a slightly higher overhead only during the initial boot phase.
TS31.9 Abusing DDS Discovery: Denial-of-Service Attacks Against ROS 2 09:02
Jiafu Xu¹, Songran Liu², Zilong Wang¹, Minghe Yu¹, Yue Tang¹, Yang Wang¹, Weiguang Pang³, Wang Yi⁴

¹ Northeastern University, China ; ² Northeastern Univeristy ; ³ Qilu University of Technology (Shandong Academy of Sciences), Jinan, China ; ⁴ Uppsala University

The Data Distribution Service (DDS) provides data-centric publish-subscribe messaging with a mandatory discovery protocol, enabling distributed applications to automatically locate and communicate with each other. ROS~2, the de facto middleware for robotic systems, adopts DDS as its communication backbone. In this paper, we demonstrate that the DDS discovery mechanism can be exploited to mount Denial-of-Service attacks against ROS~2 applications. By repeatedly triggering discovery traffic, an adversary can significantly inflate pipeline latency during runtime. We validate the attack on ROS~2 Humble with two widely used DDS implementations and a real UAV case study, confirming its effectiveness across different configurations.

ET04 Thermally Robust Photonic AI Chips: From Diamond and Graphene Integration to System-level Optimization 08:30 - 10:00 | Rigoletto

ET04.1 Thermally Robust Photonic AI Chips: From Diamond and Graphene Integration to System-level Optimization 08:30
Dharanidhar Dang¹, Shaloo Rakheja², Ahmedullah Aziz³

¹ Assistant Professor ; ² University of Illinois at Urbana-Champaign ; ³ University of Tennessee, Knoxville

This tutorial focuses on the design and implementation of thermally optimized silicon photonic (SiPh) AI chips — a frontier topic that combines the promise of energy-efficient optical computing with advanced strategies for managing heat and reliability. As AI workloads continue to expand in scale and complexity, photonic technologies are emerging as a critical solution for enhancing computational throughput while minimizing power consumption. However, this advancement introduces severe thermal challenges, including self-heating, thermo-optic drift, and inter-chiplet thermal coupling, which can limit scalability and accuracy if not addressed holistically. The tutorial will cover two main aspects: Fundamental Concepts: Dr. Shaloo Rakheja (UIUC) will introduce the fundamentals of photonic computing for AI, highlighting the physics of heat generation in microring modulators, phase shifters, and interconnects. She will discuss how materials such as diamond and graphene can enhance heat spreading and improve device stability, supported by compact modeling and multi-physics simulation frameworks. Dr. Ahmedullah Aziz (UTen, Knoxville) will then address device-level thermal reliability and the modeling of emerging photonic materials, bridging the gap between physical device behavior and compact thermal abstractions. Hands-On System Design: Dr. Dharanidhar Dang (UT San Antonio) will lead the system-level design and hands-on co-optimization component, demonstrating how device-level thermal models can inform AI workload mapping, chiplet packaging, and system architecture to achieve energy-efficient and reliable operation. Participants will use Python-based and CAD-supported workflows to visualize and manage thermal gradients in photonic AI systems. By the end of the tutorial, participants will gain both a strong theoretical foundation and practical experience in developing thermally resilient photonic AI architectures. This unique integration of materials science, device physics, and system-level design — guided by experts across three leading institutions — makes the session particularly valuable for chip designers, AI hardware engineers, and photonics researchers seeking to enable scalable, sustainable, and high-performance AI acceleration.

FS07 Focus Session: From Concept to Silicon: End-to-End Agentic AI for Smarter Chip Design (HotTopic) 08:30 - 10:00 | Bohème

From Concept to Silicon: End-to-End Agentic AI for Smarter Chip Design presents a unified and forward-looking vision of how artificial intelligence is transforming electronic design automation (EDA) across every stage of the chip design lifecycle. While most current efforts focus on applying large language models (LLMs) to specific tasks such as code generation or tool assistance, this session emphasizes a truly end-to-end perspective that integrates agentic and generative AI across high-level synthesis (HLS), physical design, testing, and security verification. The session brings together leading experts from academia and industry to demonstrate how multi-agent frameworks and reasoning-driven intelligence can achieve autonomous, adaptive, and continuously improving design flows. Topics include scalable datasets and benchmarks for LLMs in high-level synthesis, agentic automation for PPA-aware synthesis and constraint-driven test generation, generative AI for physical implementation, and adaptive AI frameworks for system-on-chip security verification. The talks converge on a cohesive vision for intelligent design systems capable of understanding intent, optimizing performance, and ensuring trustworthiness throughout the full EDA stack, marking the transition from isolated automation to true design autonomy.

Chair: Rolf Drechsler (University of Bremen, DE)

Co-Chair: Francesco Regazzoni (ALaRI, CH)

Organizers:

Annachiara Ruospo (Politecnico di Torino, IT)

FS07.1 Focus Session: Stepping Stones Towards Domain Acceleration Using AI-Driven High-Level Synthesis Design 08:30
Stefan Abi-Karam¹, Miaoyan Zhou¹, Cong "Callie" Hao¹

¹ Georgia Institute of Technology

Field-programmable gate arrays (FPGAs) combined with high-level synthesis (HLS) enable domain-specific accelerators from high-level algorithmic descriptions, but HLS has not yet democratized hardware design in practice. High-performance HLS design for accelerators still requires expert knowledge in HLS optimization, design space exploration (DSE), performance analysis, and toolflows with limited debuggability and analysis even for expert human designers. To reduce this expertise burden, the community has explored ML-guided DSE, quality-of-results (QoR) proxy models, and LLMs for HLS code generation, editing, and optimization. However, progress remains constrained by scarce, accessible, high-quality HLS datasets and the difficulty of building diverse design collections. Many existing benchmarks sample design variations from a small pool of kernels rather than expanding the base pool of diverse designs. At the same time, most LLM hardware-design benchmarking emphasizes HDLs (e.g., Verilog), leaving comprehensive benchmarking infrastructure for HLS tasks comparatively limited. Furthermore, LLM-driven agents have shown strong automation and performance for software development tasks while agentic HLS design workflows still commonly lack robust interfaces to HLS tools, structured access to HLS reports, and feedback loops spanning HLS and downstream implementation. To address these gaps, we present progress toward end-to-end rapid design of domain-specific accelerators using AI-driven approaches for HLS by targeting three problems: (1) HLS design datasets, (2) LLM benchmarking for HLS design, and (3) end-to-end agentic HLS design. For HLS design datasets, we present HLSFactory, an end-to-end framework for building diverse HLS datasets that expands individual HLS designs into diverse, cross-vendor design spaces, runs HLS/FPGA toolflows at scale, and aggregates standardized outputs into packaged datasets for benchmarking and ML, deep learning, and LLM research. HLSFactory is API-driven and extensible, enabling users to contribute designs or results and plug in custom toolflows at any stage. For LLM benchmarking, we present HLS-Eval, a benchmark and evaluation framework for LLM-driven HLS design that targets HLS code generation from natural language and HLS-specific optimization edits to existing code. It includes an "LLM-ready" benchmark for zero-shot evaluation, primarily drawing designs from HLSFactory for benchmarking. Finally, we also present our ongoing effort toward an end-to-end agentic HLS design workflow that unifies accelerator design stages in a multi-agent framework, with integrated HLS design tools from our prior works including LightningSim (fast cycle-accurate simulation), FIFO-Advisor (automated HLS FIFO sizing), AutoDSE/OptDSL/HLSFactory (structured DSE for HLS), deep-learning proxy models (QoR prediction), and libvhls (programmatic HLS tool interaction and hierarchical report analysis). Through this effort, we also plan to extend HLS-Eval to enable large-scale, parallel evaluations on realistic open-source HLS accelerators from academic projects, collecting agentic metrics on correctness, performance, agent runtime, and compute cost.
FS07.2 Focus Session: Do Agentic LLMs Change the Paradigm of Hardware Test Generation? 09:00
Farshad Firouzi¹, Pragya Sharma¹, Agastya Seth¹, Peter Domanski¹, Bahar Farahani², Sanmitra Banerjee³, Jonti Talukdar³, Krishnendu Chakrabarty¹

¹ ASU ; ² Cyberspace Research Institute, Shahid Beheshti University ; ³ NVIDIA

Technology scaling and increasing System-on-Chip (SoC) complexity exacerbate reliability challenges arising from both structural defects and runtime-dependent failures, including Silent Data Corruptions (SDCs) that evade traditional error detection mechanisms. Structural testing remains essential for detecting modeled faults such as stuck-at faults; however, it is inherently limited in capturing failures that arise under dynamic operating conditions. In contrast, functional testing can expose workload-dependent failures, albeit at the cost of high testing overhead and largely unguided workload generation. This paper presents an agentic testing framework that integrates Large Language Models (LLMs) with Reinforcement Learning (RL) and Tree-structured Parzen Estimators (TPE) to guide functional workload generation and Automatic Test Pattern Generation (ATPG) settings under user-defined constraints. The proposed approach leverages feedback-driven optimization to steer test generation toward failure-prone behaviors while reducing reliance on manual expertise. Experimental evaluation on a RISC-V processor core demonstrates that the method outperforms manually generated workloads for functional testing, while experiments on six benchmark circuits show test quality comparable to expertgenerated ATPG scripts for structural testing, with improved efficiency and scalability.
FS07.3 Focus Session: Large Language Models in Physical Design: From Data Generation to Intelligent Agents 09:30
Bing-Yue Wu¹, Atmadip Dey¹, Austin Rovinski², Vidya A. Chhabria¹

¹ Arizona State University ; ² New York University

Physical design remains one of the most complex stages of chip implementation, requiring deep expertise in electronic design automation (EDA) tools, workflows, and design knowledge. While open-source EDA tools have improved accessibility and reproducibility, the effective use of physical design flows still requires significant manual effort and domain expertise. In parallel, large language models (LLMs) have rapidly evolved from data-driven language models to assistants and tool-interacting agents capable of reasoning, code generation, and closed-loop execution. This paper presents a perspective on the evolution of LLM usage in physical design, tracing a progression from early data-driven question answering and script generation to tool-aware assistants and closed-loop agentic workflows for physical design tasks. This paper highlights this evolution using representative open-source efforts and case studies. Further, we outline emerging research directions in which agentic LLMs move toward optimization and algorithm discovery, including the automation of tasks such as engineering change orders (ECOs) and the exploration of algorithm discovery within physical design tools. The work also highlights opportunities and challenges of LLM-driven design automation.
FS07.4 VeriChat: An Agentic Conversational AI Assistant for Hardware Security Verification 10:00
Farimah Farahmandi¹

¹ University of Florida

Hardware security verification is a multi-stage process in which engineers must navigate complex design analyses, threat considerations, and verification strategies. They often need security-focused guidance, yet current verification environments provide little structured support for such assistance. Although conversational AI could offer such on-demand assistance, directly using general-purpose chatbots like ChatGPT or Gemini is risky due to their tendency to hallucinate and their reliance on static, outdated knowledge. We present VeriChat, a domain-specialized conversational assistant designed to support, rather than replace, existing verification workflows by providing context-aware security guidance. VeriChat employs a retrieval-augmented, multi- agent workflow in which three specialized agents collaboratively minimize hallucinations while improving the transparency and reliability of the response. Evaluated using a comprehensive methodology, VeriChat achieves a Faithfulness score of 87.73%, significantly outperforming the leading proprietary models.

SD03 Open-Source Hardware Landscape 08:30 - 10:00 | Auditorium

Chair: Piedad Dr. Brox (Instituto de Microelectrónica de Sevilla (CSIC/US), Spain)

Co-Chair: Milos Dr. Krstic (IHP GmbH, Germany)

SD03.1 OPEN SILICON FABRICATION – MADE IN EUROPE 08:30
Gerhard Kahmen¹

¹ IHP

Abstract: Abstract: Open source hardware development has gained worldwide momentum in the last years. Significant investments and effort in the open source CAD-ecosystem and the development of the support services, such as Tiny Tapeout, gave a significant momentum to education, research, and the development community. One of the important pillars of the open source hardware development are corresponding ASIC manufacturing capabilities. There are only few foundries worldwide, supporting open source ASIC manufacturing. In this talk the activities of IHP – Leibniz Institute for High Performance Microelectronics, which is a key research institute in Europe with focus on silicon based high frequency microelectronics and systems, will be elaborated. Already for several years IHP is very active in the domain of open source hardware, and providing the corresponding fabrication services and implementation support, including open-PDK, for its 130 nm SiGe BiCMOS process, with cutting edge HBT bipolar transistors. In this presentation, we will introduce the overview of the established open fabrication ecosystem and provide the outlook of the future developments and possibilities.
SD03.2 FROM SCHEMATIC TO SILICON: MIXED SIGNAL IC DESIGN IN OPEN SOURCE FLOWS 09:00
Harald Pretl¹

¹ Johannes Kepler University

Abstract: Open source electronic design automation (EDA) has advanced rapidly in the last few years. Spurred by the availability of open source process design kits (PDKs) from SkyWater, GlobalFoundries, and IHP, interest in using free and open tools for chip design has surged. Initiatives such as TinyTapeout, the IEEE Chipathon, and Code a Chip have broadened access, but robust end to end flows—especially for analog and mixed signal design—are not yet complete. In this talk, we present recent analog and mixed signal case studies that combine open source EDA tools with open PDKs, and we assess where the ecosystem stands today. We highlight what works, where current flows fall short, and we identify key gaps and concrete opportunities for further development.
SD03.3 BRINGING SOFTWARE DESIGN THINKING TO CHIP DESIGN 09:30
Tomi Rantakari¹

¹ ChipFlow

Abstract: The semiconductor industry is changing fast, but access to custom silicon is still limited to a handful of major players. In this keynote, Tomi Rantakari, CEO of ChipFlow, explores how applying software design thinking can open up chip design to a much broader audience. He discusses the growing role of open-source EDA tools, the hybrid business models needed to make them sustainable, and Europe's opportunity to lead in enabling SMEs. His message: custom chips are no longer the preserve of tech giants like Apple or Tesla—they're becoming accessible to innovators everywhere.

W03 Reconciling Implementation Performance and Confidence in Machine Learning 08:30 - 12:30 | Figaro

TS32 From Physical Tamper Detection to AI-Assisted Trust Verification 11:00 - 12:30 | Traviata

Chair: Davide Baroffio (DEIB, Politecnico di Milano, IT)

Co-Chair: Florenc Demrozi (University of Stavanger, NO)

TS32.1 Physical-Aware eFPGA Redaction for Secure and Efficient Hardware IP Protection 11:00
Yunqi He¹, You Li¹, Ruofan Huang¹, Guannan Zhao¹, Hai Zhou¹

¹ Northwestern University

Embedded FPGA (eFPGA)-based hardware redaction has emerged as a promising technique for protecting the intellectual property (IP) of integrated circuits. Existing approaches select a subset of the logic at the register-transfer level (RTL) and replace it with a programmable eFPGA module. However, due to their lack of awareness of physical information, these approaches incur significant power, performance, and area (PPA) overhead on the resulting chip. This paper presents a physically guided partitioning approach that divides the original design into two parts: one implemented as an application-specific integrated circuit (ASIC) and the other redacted onto an embedded FPGA fabric. It leverages a graph neural network to encode both the structural and physical information of each gate into an embedding vector. It then employs a clustering and selection process to identify the redaction candidate. Experiments demonstrate that our approach consistently reduces timing overhead while achieving comparable or superior results in terms of area, security, and resource consumption.
TS32.2 The PMP Snapshot Engine: Fast and Fault-Resilient PMP Reconfiguration for RISC-V 11:05
Christian Larmann¹, Abdullah Aljuffri², Adrian Marotzke³, Alejandro Garza⁴, Said Hamdioui¹, Mottaqiallah Taouil¹

¹ Delft University of Technology ; ² King Abdulaziz City for Science and Technology ; ³ NXP ; ⁴ NXP Semiconductors

This paper presents a Physical Memory Protection Snapshot Engine (PSE), a lightweight hardware extension for RISC-V that addresses both performance and security challenges of Physical Memory Protection (PMP) reconfiguration. By storing and restoring full PMP configurations in a single cycle, the PSE drastically reduces the overhead of context switches typically used in Trusted Execution Environments (TEEs) and secure real-time systems. At the same time, the redundant storage and two-dimensional parity protection provide an efficient and effective defense against fault injection attacks that target PMP registers. In 100k randomized trials, our experimental results demonstrate that the PSE can reliably detect and prevent FI-induced privilege escalations, while incurring only 11.7% area overhead. This makes it a practical solution for embedded devices where both efficiency and trustworthiness are essential.
TS32.3 A Reliability-Physics-Based Approach for Data Tampering Detection in Commercial 3D-NAND Flash Memory 11:10
Yuhan Wang¹, Jian Huang¹, Ruibin Zhou¹, Yao Liu¹, Haotian Ye¹, Xianping Liu¹

¹ Sun Yat-sen University

Solid-state drives (SSDs) and flash-based storage devices are vulnerable to covert data tampering, presenting a major obstacle to their secure adoption in critical sectors such as government, finance, and critical infrastructure. Currently, research on detecting data tampering in solid-state drives (SSDs) remains inadequate and some existing system-level measures can even be bypassed. To address this challenge, we propose a novel tamper-detection framework based on intrinsic physical properties of commercial 3D NAND flash memory. Central to our approach is the use of Rewrite Detection Bits (RDBs), which intentionally introduces controlled errors into stored data. The framework further integrates device-level reliability metrics alongside RDBs to improve robustness. In contrast to conventional solutions, our method requires no hardware modifications and functions entirely through standard read/write commands. Tampering attempts are identified by analyzing error bit locations and error rate distributions. Experimental results indicate that the framework can provide high security with negligible resource consumption: allocating only 6 RDBs per block (≈0.000005% of storage capacity) reduces the detection error rate to below 1.4761 × 10⁻⁷, while remaining fully compatible with existing flash memory architectures.
TS32.4 TrustSeed: Lightweight Attestation Protocol for Ensuring LLM Integrity 11:15
Mohamed Alsharkawy¹, Mohamed Aboelenien Ahmed², Hassan Nassar³, Jeferson Gonzalez⁴, Heba Khdr⁵, Osama Abboud⁶, Xun Xiao⁶, Joerg Henkel⁴

¹ Karlsruher Institut fur Technologie (KIT) - Karlsruhe Institute of Technology ; ² karlsruhe institute of technology ; ³ Karlsruher Institut für Technologie ; ⁴ KIT ; ⁵ Karlsruhe Institute of Technology (KIT) ; ⁶ Munich Research Center, Huawei Technologies Co., Ltd.

Over the last couple of years, large language models have increasingly been integrated into many computing applica- tions. For privacy preservation, they are now deployed on edge devices. However, these deployments are vulnerable to bit flip attacks and backdoor attacks that compromise the integrity of the model. Traditional remote attestation techniques fail to detect such manipulations due to the large model size and the stealthiness of the attacks. In this paper, we present TrustSeed, a lightweight functional attestation protocol that uses a single inference to ensure large language models' integrity. TrustSeed verifies integrity by applying deterministic, seed-based modifications to model weights within a Trusted Execution Environment and comparing the last intermediate activations and output distribution against a golden reference on the verifier. This approach prevents precomputed or forged responses, ensuring freshness and unpredictability in each attestation round. Our analysis shows that output distribution and last intermediate activations are effective indicators of integrity. We test TrustSeed against bit-flip, data poisoning, and weight poisoning attacks, reliably detecting even single-bit alterations. Extensive evaluations on edge platforms and an HPC system demonstrate minimal overhead and up to 127× faster attestation compared to state-of-the-art full-model hashing.
TS32.5 Future Result Capture: Timing Anomalies Reveal Data from Instructions in the Successor 11:20
Roua Boulifa¹, Marwa Chehab², Paolo Maistri², Giorgio Di Natale³

¹ TIMA ; ² TIMA Laboratory ; ³ TIMA - CNRS

Fault injection remains a critical threat to modern embedded processors, enabling adversaries to violate the expected execution of programs. In this paper, we introduce a new fault model, named Future Result Capture (FRC), observed on a commercial processor implementing the RISC-V instruction set architecture. Unlike classical models, FRC occurs when an instruction captures the result of a subsequent instruction, effectively creating a temporal inversion in the pipeline. Through extensive clock and voltage glitch experiments on a SiFive RISC-V core, we show that this phenomenon can be consistently triggered, producing instruction-level causality violations not explained by existing models. This new fault behavior exposes unexplored vulnerabilities in commercial processors, demonstrating that subtle microarchitectural effects can be exploited by physical attacks. Our findings highlight the necessity to revisit current fault models by taking into account a thorough understanding of the microarchitectural features of microprocessors, in order to design efficient countermeasures specifically addressing such vulnerabilities. Index Terms : RISC-V, Embedded System, Hardware Security, Fault Injection Attacks
TS32.6 DEEP-LENS: Deep-Learning Powered Layout Extraction and Novel Segmentation for IC Assurance and Security 11:25
Shuvodip Maitra¹, Abhishek Chakraborty², Debdeep Mukhopadhyay³

¹ Indian Institute of Technology Kharagpur ; ² IIT Kharagpur ; ³ Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur

Ensuring the physical integrity of Integrated Circuits (ICs) at the microscopic level is essential for defending against threats like Hardware Trojans and counterfeiting in the electronics supply chain. This study presents enhanced segmentation and layout modification detection capability using electron microscopy images of a delayered IC. We have developed DEEP-LENS, a robust segmentation method combining Conditional Pixel Diffusion with a Dual Residual Shifted Window U-Net architecture. This approach effectively extracts layout features, such as standard cells, from noisy SEM images. DEEP-LENS achieved impressive performance metrics, including an Intersection over Union (IoU) of 0.908, a Mean Pixel Accuracy (mPA) of 0.955, and a Dice score of 0.952 on the test dataset. IoU measures mask overlap; mPA assesses class-wise pixel accuracy; and the Dice score emphasizes true positives, enhancing sensitivity to small segmentation errors. Additionally, it detected polygon modifications with over 91% accuracy compared to the original layout designs.
TS32.7 LAsset: An LLM-assisted Security Asset Identification Framework for System-on-Chip (SoC) Verification 11:30
Md Ajoad Hasan¹, Dipayan Saha¹, Khan Thamid Hasan¹, Nashmin Alam¹, Azim Uddin¹, Sujan Saha¹, Mark Tehranipoor¹, Farimah Farahmandi¹

¹ University of Florida

The growing complexity of modern system-on-chip (SoC) and IP designs is making security assurance difficult day by day. One of the fundamental steps in the pre-silicon security verification of a hardware design is the identification of security assets, as it substantially influences downstream security verification tasks, such as threat modeling, security property generation, and vulnerability detection. Traditionally, assets are determined manually by security experts, requiring significant time and expertise. To address this challenge, we present LAsset, a novel automated framework that leverages large language models (LLMs) to identify security assets from both hardware design specifications and register-transfer level (RTL) descriptions. The framework performs structural and semantic analysis to identify intra-module primary and secondary assets and derives inter-module relationships to systematically characterize security dependencies at the design level. Experimental results show that the proposed framework achieves high classification accuracy, reaching up to 90% recall rate in SoC design, and 93% recall rate in IP designs. This automation in asset identification significantly reduces manual overhead and supports a scalable path forward for secure hardware development.
TS32.8 IMS: Intelligent Hardware Monitoring System for Secure SoCs 11:35
Wadid Foudhaili¹, Aykut Rencber², Anouar Nechi³, Rainer Buchty⁴, Mladen Berekovic⁵, Andrés Gómez², Saleh Mulhem⁴

¹ Institute of Computer Engineering, UniversitÃ¤t zu LÃ¼beck ; ² Institute for Computer and Network Engineering, Technische Universität Braunschweig ; ³ University of Lübeck ; ⁴ Institute of Computer Engineering, Universität zu Lübeck ; ⁵ Universität zu Lübeck

In the modern Systems-on-Chip (SoC), the Advanced eXtensible Interface (AXI) protocol exhibits security vulnerabilities, enabling partial or complete denial-of-service (DoS) through protocol-violation attacks. The recent countermeasures lack a dedicated real-time protocol semantic analysis and evade protocol compliance checks. This paper tackles this AXI vulnerability issue and presents an intelligent hardware monitoring system (IMS) for real-time detection of AXI protocol violations. IMS is a hardware module leveraging neural networks to achieve high detection accuracy. For model training, we perform DoS attacks through header-field manipulation and systematic malicious operations, while recording AXI transactions to build a training dataset. We then deploy a quantization-optimized neural network, achieving 98.7% detection accuracy with <=3% latency overhead, and throughput of >2.5 million inferences/s. We subsequently integrate this IMS into a RISC-V SoC as a memory-mapped IP core to monitor its AXI bus. For demonstration and initial assessment for later ASIC integration, we implemented this IMS on an AMD Zynq UltraScale+ MPSoC ZCU104 board, showing an overall small hardware footprint (9.04% look-up-tables (LUTs), 0.23% DSP slices, and 0.70% flip-flops) and negligible impact on the overall design's achievable frequency. This demonstrates the feasibility of lightweight, security monitoring for resource-constrained edge environments.
TS32.9 Unlocking Hidden Secrets: Leveraging SRAM Aging Imprints for Sensitive Data Recovery 11:40
Zakia Tamanna Tisha¹, Gaines Odom¹, Biswajit Ray², Ujjwal Guin¹

¹ Auburn University ; ² Colorado State University

Long-term data remanence in SRAMs can pose serious security risks when ICs containing sensitive information are discarded at the end of their operational life. Sensitive information can fall into unauthorized hands if these ICs are not sanitized properly. Traditionally, data remanence has been addressed primarily in DRAM and flash memories, while SRAMs have been overlooked due to very short retention periods. Hovanes et al. demonstrated that SRAMs are vulnerable to data remanence attacks, which can retrieve static data, such as firmware and keys. Their method exploits aging-induced imprints on power-up states, enabling partial recovery by comparing aged states with the originals. Although effective, this method requires maintaining records of all initial power-up states. In this paper, we propose a data recovery approach that does not require access to prior information. Our method also exploits data imprinting in SRAMs, but instead of using actual initial power-up states, we employ controlled aging to reconstruct them. Experiments on SRAM chips storing a binary image demonstrated near-complete recovery after 12 hours of controlled aging at 100°C using 32 copies.
TS32.10 Open-Source Framework for Secure Hardware Design with Simulation-based Leakage Assessment 11:41
Pablo Navarro-Torrero¹, Francisco J. Rubio-Barbero¹, Eros Camacho-Ruiz¹, Macarena Martinez-Rodriguez², Piedad Brox Jiménez³

¹ Instituto de Microelectronica de Sevilla (IMSE-CNM), CSIC-US ; ² Instituto de Microelectronica de Sevilla (IMSE-CNM), CSIC, US ; ³ CSIC

Side-channel resilience is a critical requirement for cryptographic accelerators. However, current validation approaches rely heavily on costly, measurement-based testing, which is typically applicable only at the final stages of the design flow. This reliance on physical prototypes is aggravated by the lack of integrated security analysis in fragmented toolchains. To address these challenges, we introduce the HWSEC-OSS Framework, a comprehensive open-source platform designed to streamline the security validation of hardware designs. The framework integrates a complete digital design flow with a pre-silicon Side-Channel Analysis (SCA) module based on Hamming-distance power modeling. We demonstrate the effectiveness of the framework by identifying leakage sources in an EdDSA25519 implementation, exhibiting a strong correlation between simulation-based results and measurements from a physical FPGA prototype. Furthermore, we apply the flow to a hardware implementation of ML-KEM, demonstrating scalability to Post-Quantum Cryptography (PQC). By providing an integrated environment for early security feedback, this work constitutes a fast, cost-effective solution for hardware security validation.

TS33 Efficient and Secure AI for Robotics and Autonomous Systems 11:00 - 12:30 | Tosca

Chair: Antonio Miele (Politecnico di Milano, IT)

Co-Chair: Heba Khdr (Karlsruhe Institute of Technology (KIT), DE)

TS33.1 Design and Optimization of Mixed-Kernel Mixed-Signal SVMs for Flexible Electronics 11:00
Florentia Afentaki¹, Maha Shatta², Konstantinos Balaskas¹, Georgios Panagopoulos³, Georgios Zervakis³, Mehdi Tahoori⁴

¹ University of Patras ; ² Karlsruhe institute of Technology ; ³ National Technical University of Athens ; ⁴ Karlsruhe Institute of Technology

Flexible Electronics (FE) have emerged as a promising alternative to silicon-based technologies, offering on-demand low-cost fabrication, conformality, and sustainability. However, their large feature sizes severely limit integration density, imposing strict area and power constraints, thus prohibiting the realization of Machine Learning (ML) circuits, which can significantly enhance the capabilities of relevant near-sensor applications. Support Vector Machines (SVMs) offer high accuracy in such applications at relatively low computational complexity, satisfying FE technologies' constraints. Existing SVM designs rely solely on linear or Radial Basis Function (RBF) kernels, forcing a tradeoff between hardware costs and accuracy. Linear kernels, implemented digitally, minimize overhead but sacrifice performance, while the more accurate RBF kernels are prohibitively large in digital, and their analog realization contains inherent functional approximation. In this work, we propose the first mixed-kernel and mixed-signal SVM design in FE, which unifies the advantages of both implementations and balances the cost/accuracy trade-off. To that end, we introduce a co-optimization approach that trains our mixed-kernel SVMs and maps binary SVM classifiers to the appropriate kernel (linear/RBF) and domain (digital/analog), aiming to maximize accuracy whilst reducing the number of costly RBF classifiers. Our designs deliver 7.7% higher accuracy than state-of-the-art single-kernel linear SVMs, and reduce area and power by 108× and 17× on average compared to digital RBF implementations.
TS33.2 A Multi-Sensor Approach for Soft Labeling in Human Activity Recognition Domain 11:05
Matteo Iervasi¹, Cristian Turetta², Florenc Demrozi³, Graziano Pravadelli⁴

¹ University of Stavanger ; ² Wenzhou Business College ; ³ Department of Electrical Engineering and Computer Science, University of Stavanger ; ⁴ University of Verona

Manual annotation (MA) of sensor data for Human Activity Recognition (HAR) is labor-intensive, error-prone, and limits scalability. This paper proposes a multi-sensor methodology to automatically generate training labels (aka. soft labels) for HAR systems without human intervention. The approach integrates data from inertial measurement units glued to objects of daily life with the Received Signal Strength Indicator (RSSI) information derived from BLE beacon anchors for estimating both the performed activity and the subject's location. We validate the quality of the generated soft labels against video- based MA ground truth. Experimental results show that a deep learning model for HAR trained on a Wi-Fi Channel State Information (CSI) dataset annotated with soft labels achieves comparable results with respect to the same model trained on the corresponding manually-annotated dataset
TS33.3 LiteDVS: A Low-Data-Redundancy Dynamic Vision Sensor with Hybrid Readout and In-Pixel Denoising 11:10
Zichen Kong¹, Zhongyi Wu¹, Xiyuan Tang¹, Yuan Wang¹

¹ Peking University

Dynamic Vision Sensors (DVS) are well suited for latency- and power-sensitive applications such as embodied intelligence and autonomous driving, owing to their event-driven operation and high spatiotemporal efficiency. However, under camera motion or low-light conditions, DVS frequently produces redundant or noisy events, compromising data sparsity and reliability. To address this challenge, we propose LiteDVS, a DVS architecture with region-aware hybrid readout and in-pixel denoising. LiteDVS integrates event streams for regions of interest with event frames for background areas, significantly reducing data redundancy. Furthermore, a lightweight in-pixel filter compatible with both readout modes is designed to suppress noise events with negligible latency overhead. Simulations in a SMIC 55 nm logic CMOS process demonstrate that LiteDVS achieves accurate denoising with energy consumptions of 317 fJ/event in stream mode and 41.8 fJ/event in frame mode.
TS33.4 An Environment-Aware Verification Framework for LLM-Generated Robot Control Programs 11:15
Zhanshang Nie¹, Wenbo Wang¹, Xuanming Liu¹, Yue Zhang¹, Zhendong Chen¹, Zirui Wang¹, Kai Huang¹, Shuai Zhao¹

¹ Sun Yat-sen University

Large language models (LLMs) are increasingly used in robotics to translate natural language instructions into executable control programs via task-specific prompts. However, existing approaches often lack correctness guarantees for LLM-generated programs, leading to compilation errors and runtime failures. While some methods consider a verification mechanism, they typically assume complete prior knowledge of the environment, making them unsuitable for complex environments where such knowledge is unavailable. This paper introduces VeBot, an environment-aware verification framework designed to ensure the correctness of robot control programs generated by LLMs. Specifically, VeBot introduces: (i) an LLM-friendly robot control language (RCL) that facilitates the program generation by abstracting away the complex Python code details, (ii) a compiler that translates LLM-generated RCL programs into a control flow graph (CFG) while verifying the lexical, syntactic, and semantic correctness, and (iii) a runtime verification mechanism that checks the CFG and compiles the verified segments into executable Python code, avoiding collisions or planning failures during execution. We illustrate the VeBot framework using a household scenario, and the evaluation shows that it consistently outperforms existing methods across a range of LLMs and tasks, achieving high success rates even with lightweight LLMs.
TS33.5 Kolmogorov-Arnold Networks for Autonomous Driving: A Hardware-in-the-Loop Comparison with Conventional Deep Neural Networks 11:20
Chaoran Yuan¹, Fadi Kurdahi¹

¹ Center for Embedded and Cyber-Physical Systems, University of California, Irvine, USA

Autonomous driving systems require AI models that balance accuracy, efficiency, and interpretability for practical deployment. This paper reports a hardware-in-the-loop (HIL) comparison of Kolmogorov–Arnold Networks (KANs) and conventional Deep Neural Networks (DNNs) within an autonomous driving pipeline executed in closed-loop. Using a digital-twin testbed with real-time hardware execution, we evaluate driving performance, perception accuracy, and planning quality. In our experiments, KAN-based controllers achieve performance comparable to DNNs while using fewer parameters, and provide a degree of interpretability by enabling symbolic approximations of their learned functions in simplified scenarios. A hybrid KAN–DNN architecture, which integrates KAN functional layers with standard dense layers, showed both improved transparency and competitive performance across tasks. Unlike black-box DNNs, the KAN formulation permits symbolic inspection of planning policies, facilitating verification and design-time analysis. These results indicate that KANs are a viable option for embedded AI in autonomous systems, offering efficiency for resource-constrained hardware while providing opportunities for improved interpretability.
TS33.6 STAR: High-DoF Robotic Manipulation for Memory-Constrained NN Accelerator 11:25
Jhao-Ying Chen¹, Wen Sheng Lim², Tei-Wei Kuo¹, Yuan-Hao Chang¹

¹ National Taiwan University ; ² National Taiwan University (NTU)

As robotic manipulators adopt increasingly higher degrees of freedom (DoFs) to handle complex tasks, the corresponding growth in neural network (NN) size leads to substantial memory and energy demands, making deployment on memory-constrained NN accelerators increasingly impractical. To overcome this challenge, we propose STAR, a novel framework that enables accurate and energy-efficient high-DoF manipulation under strict memory constraints. STAR introduces a spherical task-space approximation to compactly represent the manipulator's reachable space, followed by memory-aware partitioning that divides this space into smaller, manageable regions, each handled by a lightweight NN. Importantly, STAR employs deep reinforcement learning (DRL) to learn absolute pose-to-joint mappings, allowing each task to be completed with a single NN load—eliminating the need for large networks or frequent NN switching. Experiments demonstrate that STAR achieves up to 8.09× faster execution and 15.81× lower energy consumption, while reducing memory usage by up to 128× compared to state-of-the-art approaches, all without compromising control accuracy.
TS33.7 A Real-Time Robotic Diffusion Policy Accelerator Exploiting Self- and Cross-Guided Modal Similarity 11:30
Boju Chen¹, Xiaoyu Feng¹, Junyan Lin¹, Chen Tang¹, Huazhong Yang¹, Yongpan Liu¹

¹ Tsinghua University

Diffusion Policy (DP) has demonstrated strong potential in robotic visuomotor control, offering robust generalization and seamless integration of multi-modal data. However, its complex model structure and increasing multi-modal inputs have brought latency and power challenges for edge resource-constrained robotic platforms. To address the above challenges, we identify the potential intra-model and inter-model redundancies in DP. We observe that DP relies on frequent multi-modal inputs such as images and text during execution. However, the demands of fine-grained robotic manipulation result in substantial intra-modal similarity across consecutive image frames, which, combined with inter-modal semantic redundancy between images and language, indicates that much of the input information is repetitive and potentially compressible. Yet prior works have not exploited these characteristics for targeted optimization. We therefore propose a hardware–software co-design accelerator. On the algorithmic side, we introduce self- and cross-guided modal compression, leveraging intra- and inter-modality similarity to reduce redundant computation within the key DP modules. On the hardware side, we design a tailored architecture that supports multiple operators with optimized sparse memory access, lightweight computation engines, and reconfigurable on-chip dataflow, substantially reducing energy cost. Experimental results demonstrate a 26× speedup over a high-performance GPU while consuming only 1.5 W, enabling low-power and real-time robotic control on edge robotic devices.
TS33.8 TEE-based On-demand Key Distribution for Hierarchical In-Vehicle Zonal Architecture 11:35
Wonseok Song¹, Sanghoon Jeon¹, Jong-Chan Kim¹

¹ Kookmin University

As vehicles migrate from CAN to Ethernet to support communication-heavy applications, factory-installed static cryptographic keys are exposed to a broader attack surface. Such static keys are no longer secure in the long term throughout a vehicle's long lifecycle. Thus, we propose an on-demand key distribution system considering the hierarchical zonal E/E architecture. In our scheme, upon receiving a new master key from a server by a central vehicle computer, newly derived keys are distributed to zone controllers, and in turn to low-level ECUs in a hierarchical manner. We specifically focus on isolating each zone by securing the derivation and distribution of sub-master keys between the central computer and zone controllers by leveraging the hardware-level security of trusted execution environment (TEE). We implement a prototype on NVIDIA Jetson platforms, where the overheads of cryptographic operations and end-to-end delays are evaluated. Our experiments show that in a complex zonal architecture, the entire set of keys in a vehicle can be renewed in 480 ms, which is shorter than the engine starting time.
TS33.9 FALCON: A Fast and Low-Power Current-Mode Near-Sensor-Computing Architecture for Real-Time Edge Visual Processing 11:40
Liang Zhang¹, Jing Kou¹, Jinyao Mi¹, Yang Liu¹, Junda Zhao¹, Junzhan Liu¹, Wang Kang¹

¹ Beihang University

Near-sensor computing (NSC) has emerged as a promising paradigm for edge visual processing and data compression, to mitigate data transmission and computing overheads at IoT nodes. However, existing NSC still suffers from limited precision, reduced frame rate and low energy efficiency under complex DNN tasks due to inefficient analog memory , exponential computation overheads and considerable ADC burden. This paper introduces FALCON, a novel current-mode (CM) NSC architecture featuring in-current-register-processing (ICRP) unit and two-step multiply-and-accumulate (TS-MAC) for high-precision and low-latency feature extraction. Additionally, a reconfigurable ADC with embedded ReLU and pooling functionality is employed to improve ADC overhead and compression ratio. Implemented under a 55nm CIS process, FALCON achieves 12.92 TOPS/W with 7-bit weight precision and supports a frame rate of 3096 fps under 8 filters, with an iFOM of 10.1 pJ/pix·fps.

TS34 Design methodologies for efficient machine learning 11:00 - 12:30 | Nabucco

Chair: Lukas Sekanina (Brno University of Technology, CZ)

Co-Chair: Chiara Sandionigi (CEA, FR)

TS34.1 LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction 11:00
Sanggeon Yun¹, Hyunwoo Oh¹, Ryozo Masukawa¹, Pietro Mercati², Nathaniel Bastian³, Mohsen Imani⁴

¹ University of California, Irvine ; ² Intel ; ³ United States Military Academy ; ⁴ University of California Irvine

Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard "one prototype per class" design requires O(CD) memory (with C classes and dimensionality D). Prior compaction reduces D (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the C per-class prototypes with n ≈ ceil(log base k of C) bundle hypervectors (alphabet size k) and decodes in an n-dimensional activation space, cutting memory to O(Dlog_k C) while preserving the feature dimensionality. LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly 2.5x–3.0x higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers ~498x energy efficiency and ~62.6x speedup over an AMD Ryzen 9 9950X and ~24.3x/~6.58x over an NVIDIA RTX 4090, and is ~4.06x more energy-efficient and ~2.19x faster than a feature-axis HDC ASIC baseline.
TS34.2 Boosting the Performance of Tree-Based Speculative Decoding of LLMs on FPGAs 11:05
Tielong Liu¹, Gang Li¹, Zitao Mo², Zeyu Zhu¹, Minnan Pei¹, Jian Cheng¹

¹ Institute of Automation, Chinese Academy of Sciences ; ² casia

As an efficient alternative to autoregressive decoding, tree-based speculative decoding (SD) has been widely adopted to accelerate LLM inference on GPUs. However, due to the notable disparity in compute power and memory bandwidth, we observe that a specific target-draft model pair with a proper decoding configuration, despite demonstrating significant performance gains on GPUs, often fails to maintain its efficacy on FPGAs, and may even underperform the standard autoregressive decoding approach. In this paper, we propose an analytical framework to revive the performance of tree-based speculative decoding on FPGAs. We introduce effective performance, a roofline-based metric designed to: 1) assess whether a specific target-draft model pair can benefit from SD for the given FPGA platform, and 2) determine the optimal decoding configuration to achieve peak performance when SD is applicable. We also propose a prior-score-based search strategy to identify the optimal tree structure for a preset number of nodes, further enhancing the performance. We evaluate our method on AMD FPGA platforms using two state-of-the-art SD algorithms: LongSpec and EAGLE-3. Our approach demonstrates a speedup of 2.54-3.89x over autoregressive decoding.
TS34.3 Equicore: Accelerating Clebsch-Gordan Tensor Product of Equivariant Neural Networks on FPGA 11:10
Shidi Tang¹, Chuanzhao Zhang¹, Ruiqi Chen², Yuxuan Lv¹, Bruno Silva², Ming Ling¹

¹ Southeast University ; ² Vrije Universiteit Brussel

Equivariant neural networks (ENNs) are a powerful framework for modeling 3D geometric data in physical and biological systems. The Clebsch–Gordan tensor product (CGTP)—a core operation for preserving equivariance—remains the primary computational bottleneck in ENNs. Although Clebsch–Gordan (CG) coefficients exhibit pronounced structural sparsity, prior work has neither fully leveraged this property nor adopted hardware-friendly quantization, leading to limited efficiency. We present \textit{Equicore}, a software–hardware co-design framework to accelerate CGTP in ENNs. \textit{Equicore} introduces three key innovations: (1) a sparse-bypass strategy that exploits the CG structural sparsity together with a novel CG data format to pack the overlapping non-zeros, bypassing redundant data accesses and computations comparing to previous sparse solutions; (2) a merged-shift quantization strategy that enables full Int8 representation of irreps, weights, and CG coefficients using shift-only operations; and (3) a cascaded processing unit that tightly couples the FPGA hardware resources to achieve high operating frequency while supporting efficient sparse and quantized computation. Deployed on a AMD Virtex VCU128 platform, \textit{Equicore} delivers up to 10.5$\times$ speedup and 17.4$\times$ energy-efficiency improvement over state-of-the-art GPU libraries and FPGA designs across diverse CGTP types in a benchmark of eleven ENN models.
TS34.4 QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention 11:15
Hyunwoo Oh¹, Hanning Chen¹, Sanggeon Yun¹, Yang Ni², Wenjun Huang³, Tamoghno Das¹, Suyeon Jang¹, Mohsen Imani⁴

¹ University of California, Irvine ; ² Purdue University Northwest ; ³ UC Irvine ; ⁴ University of California Irvine

Deformable Transformers achieve state-of-the-art object detection, but deformable attention maps poorly to hardware due to irregular memory access and low arithmetic intensity. We present QUILL, a schedule-aware accelerator that makes MSDeformAttn cache-local and single-pass. QUILL's \emph{Distance-based Out-of-Order Querying (DOOQ)} reorders queries by spatial proximity, enabling a look-ahead, double-buffered prefetch that overlaps memory and compute. QUILL also uses a fused MSDeformAttn pipeline that performs interpolation, Softmax, aggregation, and output projection in one pass, avoiding intermediate spills and keeping small tensors on-chip. Implemented in RTL and evaluated end-to-end, QUILL achieves up to 7.29× higher throughput and 47.3× better energy efficiency than an RTX~4090, and improves throughput/energy efficiency over prior accelerators by 3.26-9.82× / 2.01-6.07×. With mixed precision, accuracy stays within ≤ 0.9~AP of FP32 across Deformable and Sparse DETR variants. By converting sparsity into locality and locality into utilization, QUILL delivers consistent end-to-end gains.
TS34.5 FLARE: Finetuning ReLU With FIRE for Efficient Long-Context Inference 11:20
Michael Moffatt¹, Junyi Luo², Haoran Cheng¹, Qilong Wang¹, Xinting Jiang², Guanchen Tao¹, Shiwei Liu¹, Kauna Lei², Gregory Kielian³, Mehdi Saligane²

¹ University of Michigan ; ² Brown University ; ³ Google

Deploying large language models (LLMs) on resource-constrained edge devices, such as mobile phones or IoT devices, is highly desirable for enabling secure, personalized on-device AI. However, there are significant challenges due to these models' high computational and memory demands. A key bottleneck lies in the Transformer's attention block, especially when handling long contexts. Techniques like model architectures with Rectified Linear Unit (ReLU) activations for Softmax and FIRE positional encoding (a resource-efficient, automatic context-length-scaling alternative to Rotary Positional Embedding (RoPE)) have each independently shown promise in reducing the computational complexity of the attention block, but the proper alchemy for combining their benefits remains underexplored. In this paper, we show a method for combining FIRE and ReLU that maintains low-validation loss at long contexts. We also introduce FLARE, a new algorithm that further improves efficiency by removing operations from the learned relative position encoding in FIRE. Our approach leads to faster inference on long sequences, robust generalization to varying context lengths, and lower validation loss compared to baseline models. FLARE achieves a significant reduction in power and area consumption. On custom hardware, it achieves a 6 times higher operating frequency than Softmax, while occupying 57 times less silicon area (measured under different throughput settings) and consuming 600 times less energy. Our results indicate that FLARE represents a significant step towards deploying powerful LLMs efficiently on resource-limited devices. Code and hardware designs are publicly available at: https://github.com/ReaLLMASIC/ReaLLM-Forge
TS34.6 CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device 11:25
Ye Lin¹, Chao Fang², Xiaoyong Song³, Qi Wu¹, Anying Jiang¹, Yichuan Bai¹, Li Du¹

¹ Nanjing University ; ² KU Leuven ; ³ China Mobile Research Institute

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42× and 4.25× speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12× in LBIM compared to HBCEM.
TS34.7 Identifying Hardware Optimizations for Neural Network Inference using Virtual Prototypes 11:30
Jan Zielasko¹, Rolf Drechsler²

¹ Cyber-Physical Systems, DFKI GmbH ; ² University of Bremen/DFKI

Identifying the optimal hardware configuration for neural network inference on ultra-low-power edge devices is critical for reducing cost and maximizing the performance of smart applications. Tailoring hardware designs to specific applications significantly improves resource utilization. However, locating profitable optimization points in complex workloads remains challenging, particularly when critical code lies outside the main compute kernels. We present an approach based on a RISC-V virtual prototype to systematically identify fine-grained hardware optimization opportunities. The virtual prototype models the entire hardware platform, including accelerators, while remaining fast and accessible. Combined with a custom execution-trace compression and analysis framework, it enables the capture and processing of billions of executed instructions. Applied to representative edge artificial intelligence workloads from the MLPerf Tiny benchmark suite, our approach successfully identifies promising optimization opportunities beyond the matrix multiplication kernel that are non-trivial to detect from either the source code or gate-level analysis. We further validate our method using typical embedded workloads from the Embench IOT 2.0 suite, demonstrating its applicability to a wide range of embedded workloads.
TS34.8 Towards Bit-Shareable Inference on Microcontrollers 11:35
Charalampos Bezaitis¹, Yaman Umuroglu², Di Liu¹, Magnus Sjalander¹

¹ Norwegian University of Science and Technology ; ² AMD

Many embedded systems are now being deployed in energy-constrained environments, with some systems utilizing energy-harvesting technologies. Consequently, the energy available to these systems is dynamic. For example, energy harvesting from the sun can provide excess energy during the daytime, but energy levels run low at night. In such energy-harvesting environments, low-power microcontroller (MCU) platforms are used to run machine learning inference, but their software is not adaptive to the energy fluctuations. BitSIM is the first to provide a clear methodology to train and deploy switchable-precision networks (SP-nets) that tackle the challenges of an MCU platform BitSIM employs a novel quantizer, PolyQAT, which not only enables weight-sharing but also bit-shareable weights. In bit- shareable weights, the narrower-precision weight can be directly extracted from the wider weight. With PolyQAT, SP-nets can be trained with low precision (i.e., with weights of four bits or less), which enables the deployment of large networks with respect to the memory size of MCUs. For the deployment of the SP-nets, BitSIM considers one minimalistic MCU hardware extension that enables efficient execution of sub-byte quantized neural networks
TS34.9 Scalable Symbolic Reasoning with Matrix-Based Brain-Inspired Representations and Vector-Space Acceleration 11:36
William Chung¹, Hyunwoo Oh¹, Hamza Errahmouni Barkam², Calvin Yeung³, Mohsen Imani³

¹ University of California, Irvine ; ² University Of California Irvine ; ³ University of California Irvine

Hyperdimensional Computing (HDC) enables robust, hardware-friendly symbolic computation, but mainstream complex-valued HDC uses commutative binding and relies on costly permutations to encode order and directionality. Generalized Holographic Reduced Representations (GHRR) replace commutative binding with non-commutative matrix multiplication, enabling native encoding of sequences, directed graphs, and hierarchies without permutation logic. However, naive GHRR incurs prohibitive matrix compute/storage overhead. We present a vector-space flattening method that preserves GHRR semantics while executing similarity, training updates, and inference directly using standard high-throughput dot-product engines. Additionally, we design a custom ASIC accelerator that fuses binding and similarity operations into a unified complex-valued data path, which supports high-throughput streaming via dual DMA engines and performs runtime normalization for accurate inference. Compared to a PyTorch baseline on an NVIDIA RTX 4090 GPU, the prototype delivers 1.36x–1.56x higher throughput and achieves 16.2x–18.6x better energy efficiency. These results demonstrate a scalable pathway to embedding brain-inspired symbolic reasoning in future AI accelerators.

LBR02 From Sensors to Systems: Efficient and Resilient AI Models and Accelerators 11:00 - 12:30 | Aida

Chair: Haralampos Stratigopoulus (Sorbonne Université, CNRS, LIP6, FR)

Co-Chair: Annachiara Ruospo (Politecnico di Torino, IT)

LBR02.1 Late Breaking Results: Input Loss Curvature as a Predictor of Sample Vulnerability to Hardware-Noise 11:00
Deepika Sharma¹, Deepak Ravikumar¹, Chih-Hsing Ho¹, Sangamesh Kodge¹, Kaushik Roy¹

¹ Purdue University

Analog in-memory computing (AIMC) accelerators can deliver significant energy efficiency over conventional von Neumann architectures, but their accuracy is limited by device and circuit-level non-idealities. While prior work has characterized the effects of these non-idealities at model or layer granularity, their impact on individual samples remains largely unexplored. In this work, we show that an input's vulnerability to such non-idealities can be strongly predicted by its loss curvature, a metric capturing how sharply the loss changes under small input perturbations. Across multiple models, datasets, and non-idealities, our experiments reveal a strong positive correlation between input loss curvature and analog non-ideality-induced failures, with failure rates increasing substantially for high-curvature samples. These findings uncover a previously overlooked, sample-level dimension of hardware robustness and suggest new opportunities for input-aware strategies for addressing non-idealities.
LBR02.2 Late Breaking Results: Algorithm–Hardware Co-Design of a Sparsity-Aware Dense–Sparse Scheme for DNN Accelerators 11:03
yueting Li¹

¹ Hangzhou International Innovation Institute, Beihang University

Deep neural networks (DNNs) in modern applications have increased the demand for energy-efficient DNN inference solutions, especially on resource-constrained platforms. However, the growing model capacity of DNNs incurs significant memory traffic and energy consumption. To address these challenges, we propose a novel solution that presents an algorithm-hardware co-design for reconfigurable DNN acceleration. This design exploits value- and bit-level sparsity to minimize memory footprint and enhance computational efficiency. To achieve this, the proposed algorithm leverages a static dense-sparse storage format, along with a dynamic bit-processing scheme that removes non-contributing bits. Building on this algorithm, a flexible processing element array is designed to perform LUT-based shift-accumulate operations, with fine-grained per-layer configurability. Experimental results show that this design yields 13-24% storage savings across the evaluated DNN models, while delivering up to 8.4x effective sparsity. Based on post-implementation FPGA results (from our RTL design), the proposed accelerator delivers 1.41x lower LUT usage than state-of-the-art design at similar throughput.
LBR02.3 Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing 11:06
Daniel Stein¹, Shaoyi Huang², Rolf Drechsler³, Bing Li⁴, Grace Li Zhang¹

¹ TU Darmstadt ; ² Stevens Institute of Technology ; ³ University of Bremen/DFKI ; ⁴ TU Ilmenau

Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation.
LBR02.4 Late Breaking Results: POSEiDON: Pose Estimation in Dynamic On-device Networks via Hyperdimensional Computing 11:09
Colin Dupuis¹, Emilien Meyer¹, Abu Kaisar Mohammad Masum¹, Sercan Aygun¹

¹ University of Louisiana at Lafayette

On-device pose estimation is widely available through models, yet efficiently classifying the resulting landmarks on resource-constrained mobile platforms remains challenging. We propose POSEiDON, a hyperdimensional computing (HDC)-based pose classification pipeline that combines landmark-symbol and joint-angle encodings with low-discrepancy (Sobol) sequences. On a custom 4-class mobile dataset and a 5-class YOGA benchmark, POSEiDON achieves up to 97.79% and 88.44% accuracy, respectively, using a single training pass, while iterative baselines require 10–200 passes or estimators to reach comparable accuracy. An Artix-7 FPGA prototype and an Android implementation further show that the HDC stage adds only marginal power, latency, and memory overhead, indicating that HDC is a promising primitive for lightweight on-device pose recognition.
LBR02.5 Late Breaking Results: Uncovering the Limits of ECCs in Vision Transformers and a Zero-Cost Reliability Enhancement 11:12
Mohammad Hasan Ahmadilivani¹, Marten Roots¹, Marco Restifo², Sven-Markus Loorits¹, Luca Di Mauro², Jaan Raik³

¹ Tallinn University of Tehnology ; ² ARM ; ³ Tallinn University of Technology

Modern Artificial Intelligence (AI) workloads have increasingly penetrated safety-critical domains (e.g., automotive systems) and reliability-sensitive infrastructures, where transient hardware faults pose significant risks to system dependability. In this work, we conduct the first study of the impact of Error Correction Codes (ECCs) on the reliability of Vision Transformers (ViTs). Building on these insights, we introduce MSET, a lightweight, application-aware protection scheme that selectively hardens the most vulnerable bits in ViT parameters without incurring any memory overhead. Our experiments demonstrate that MSET significantly improves the reliability of ViT inference, surpassing conventional SECDED ECCs. Moreover, the method is fully compatible with existing ECC mechanisms and can be integrated alongside them to achieve even higher resilience. Overall, this study underscores that fine-grained, model-level protection strategies offer substantial reliability benefits for memory-intensive ViTs, enabling reliable deployment in highly error-prone environments.
LBR02.6 Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models 11:15
Yizhi Chen¹, Ahmed Hemani²

¹ KTH Royal Institute of Technology ; ² KTH - Royal Institute of Technology

We propose Quamba-SE, a soft-edge quantizer for State Space Model (SSM) activation quantization. Unlike existing methods using standard INT8 operations, Quamba-SE employs three adaptive scales: high-precision for small values, standard scale for normal values, and low-precision for outliers. This preserves outlier information instead of hard clipping, while maintaining precision for other values. We evaluate on Mamba-130M across 6 zero-shot benchmarks. Results show that Quamba-SE consistently outperforms Quamba, achieving up to +2.68% on individual benchmarks and up to +0.83% improvement in the average accuracy of 6 datasets.
LBR02.7 Late Breaking Results: DAIQUIRI: Dynamic Quantization with Layer-wise Sensitivity Ranking for Hardware-Efficient LLMs 11:18
Tanha Tasfia¹, Abu Kaisar Mohammad Masum¹, Mehran Moghadam², M. Hassan Najafi², Sercan Aygun¹

¹ University of Louisiana at Lafayette ; ² Case Western Reserve University

Large language models (LLMs) are widely used, but their high memory and compute demands make them difficult to deploy on low-power devices. Low-bit quantization reduces model size by converting full-precision weights into compact 2- or 3-bit formats, yet uniform low-bit quantization often forces a tradeoff between efficiency and accuracy. In this work, we introduce DAIQUIRI, a dynamic low-bit quantization scheme that assigns bit-widths based on per-layer sensitivity. By using 2-bit precision for most layers and selectively applying 3-bit precision to more sensitive layers, DAIQUIRI maintains accuracy much higher than uniform 2-bit quantization while reducing energy, power, and latency compared to uniform 3-bit models. Experiments across multiple datasets and GPUs show that DAIQUIRI achieves accuracy close to the 3-bit baseline while using significantly fewer hardware resources, making it a practical and efficient solution for on-device LLM deployment.
LBR02.8 Late Breaking Results: ADC-FIST: ADC-Free In/Near-Sensor Stochastic Object Tracking 11:21
Mehran Moghadam¹, Sepehr Tabrizchi², Ali Shafiee Sarvestani², Sercan Aygun³, Arman Roohi², M. Hassan Najafi¹

¹ Case Western Reserve University ; ² University of Illinois Chicago ; ³ University of Louisiana at Lafayette

In/Near-sensor computing is a promising approach for efficient vision processing; however, conventional approaches often rely on power-hungry analog-to-digital converters (ADCs) and complex computational circuits. We introduce ADC-FIST, a novel framework that eliminates ADCs by leveraging stochastic computing (SC). Our approach replaces traditional ADCs with lightweight analog-to-bit-stream converters and simple logic gates for object tracking. ADC-FIST achieves comparable accuracy to conventional binary approaches while reducing power consumption and area by 89% and 88%, respectively. It enables efficient, always-on object detection and tracking, demonstrating the potential of ADC-free SC designs for resource-constrained vision systems.
LBR02.9 Late Breaking Results: SP-HD: Stochastic Projection-Based HyperDimensional Architecture for Near-Sensor Image Classification 11:24
Ahmed Mamdouh¹, Sabrina Hassan Moon¹, Abu Kaisar Mohammad Masum², Emilien Meyer², Sercan Aygun², Dayane Reis¹

¹ University of South Florida ; ² University of Louisiana at Lafayette

This paper presents SP-HD, a near-sensor image classification architecture that combines stochastic computing and hyperdimensional computing to enable energy-efficient and compact embedded intelligence. The proposed approach introduces a stochastic projection mechanism that converts input features into bitstreams, enabling bipolar multiplications to be performed with simple logic and in-memory accumulation, thereby eliminating costly multipliers and level hypervectors. A mixed-signal ReRAM-based implementation further reduces data movement by performing projection and accumulation directly within the memory fabric, while binary-weight classification minimizes circuit complexity. SP-HD achieves competitive accuracy across multiple image datasets and delivers 3μJ energy per inference with a compact 2.56mm2 hardware footprint, significantly outperforming prior ReRAM compute-in-memory accelerators in both energy and area efficiency. These results highlight the effectiveness of stochastic-HDC co-design for low-power, always-on vision processing in edge and IoT applications.
LBR02.10 Late Breaking Results: Adaptive Ensembles of Dynamic DNNs for Collaborative Edge Inference 11:27
Mingyu Hu¹, Amit Kumar Singh², Jonathon Hare¹, Geoff Merrett¹

¹ University of Southampton ; ² University of Essex

Edge computing enables low-latency and privacy-preserving DNN inference, yet heterogeneous and dynamically changing device resources make it difficult to satisfy real-time constraints. In this paper, we present AdaEnsemble, an adaptive and collaborative ensemble inference framework that integrates Dynamic DNNs with deadline-aware scheduling. The system profiles accuracy and latency offline and selects both model widths and participating devices at runtime to maximize accuracy under a given deadline. Experiments on heterogeneous edge devices show that AdaEnsemble adapts effectively to different latency requirements and consistently outperforms the state-of-the-art.

ES01 Sponsors Executive Session I 11:00 - 12:30 | Rigoletto

Chair: Anton Klotz (Fraunhofer, DE)

Co-Chair: Andrea Kells (Arm, UK)

ES01.1 Open Chiplet Atlas for Future Intelligence 11:00
Lichen Weng¹

¹ Tenstorrent

The semiconductor industry is at a crossroads where the "one-size-fits-all" approach to AI compute is no longer economically or technically viable. To sustain the AI revolution, we must embrace silicon diversity and open standards. In this session, Tenstorrent presents the Open Chiplet Atlas™ (OCA)—a strategic initiative to build an interoperable, multi-vendor chiplet ecosystem designed to disrupt current AI hardware cost structures. Speaking to the themes of DATE 2026, this talk explores how the OCA leverages modular open-source stacks to lower the barrier for entry in custom AI hardware. We will demonstrate how the OCA open-source reference designs, combined with specialized simulation and virtualization tools, enable designers to mix-and-match silicon components. These tools provide the "Design, Automation, and Test" community with the necessary framework to manage the complexity of next-generation AI SiPs. In the context of the European Chips Act and the global push for sovereign silicon, we discuss how the OCA fosters a decentralized, collaborative, and highly scalable innovation landscape for future intelligence.
ES01.2 Engineering evolution, from reinforce Learning to Agentic AI 11:30
Riccardo Giordani¹

¹ Synopsys

The Complexity of new design is growing exponentially , like also the speed of innovation in EDA , In few years we moved from Reinforce Learning to Agentic AI . Harnessing the power of AI within EDA tools unlocks new levels of innovation: streamlining workflows, enhancing performance, and empowering engineers to meet the ever-growing demands of tomorrow's market. This talk will explore how AI-driven EDA solutions are shaping the future of semiconductor design , bringing new products on the market and accelerating the innovation.
ES01.3 Leading-Edge Chip Design in the Cloud with makeChip - Europe's Advanced Design Enablement Team 12:00
Patrick Döll¹

¹ Racyics

This presentation introduces makeChip, Racyics' cloud-based chip design platform, and its role as Europe's advanced Design Enablement Team (DET). makeChip provides start-ups, SMEs, universities, and research institutes with a turnkey environment for advanced circuit design. The platform enables an immediate project start with industry-proven design processes and provides access to commercial EDA tools, PDKs, IPs, and silicon-validated design flows within a secure and scalable cloud infrastructure. Beyond the design environment itself, makeChip enables access to the European semiconductor supply chain, including direct foundry access for MPW runs and volume production, combined with advanced packaging, testing, and qualification capabilities. The presentation will further highlight the integration of makeChip into the EU Chips Design Platform and explain how European start-ups and SMEs can benefit from access to the Platform Coordination Team as well as associated funding mechanisms that support chip design and prototyping activities.

FS08 Focus Session: MLIR, QIR, and Other Intermediate Representations - Towards Production-Ready Compilation for Quantum Computing 11:00 - 12:30 | Bohème

Quantum computing is transitioning from academic research into an emerging industry, with increas-ing demand for robust and scalable software and compiler infrastructures. Over the past decade, many compilers and frameworks for quantum programming have been created?mostly in academic contexts and in big parts developed by the design automation community. While valuable for proto-typing, these fragmented, stand-alone tools lack the interoperability, consistent abstractions, and mature workflows required for industry adoption. Additionally, they often have a hard time to scale to large qubit numbers required for relevant real-world applications.

Chair: Robert Wille (Technical University of Munich, DE)

Co-Chair: Shigeru Yamashita (Ritsumeikan University, JP)

Organizers:

Robert Wille (Technical University of Munich, DE)
Shigeru Yamashita (Ritsumeikan University, JP)

FS08.1 Quantum Application Development Using CUDA-Q 11:05
Bettina Heim¹

¹ NVIDIA

Impactful quantum applications are hybrid in nature, meaning they take advantage of both classical and quantum resources for computing. While the integration of quantum processors (QPUs) into HPC systems gives rise to unique opportunities for accelerating subroutines that would be prohibitive to execute on classical processors, it also poses unique challenges for coordination between different processors. Integration efforts today are still in their infancy, and further research is needed to ex-plore how to build and leverage heterogeneous quantum-classical systems to their fullest extent. As QPUs continue to grow in size and capabilities, it is important to co-develop the classical hardware and software needed for the efficient compilation and execution of hybrid applications. In this con-text, intermediate representations such as QIR (Quantum Intermediate Representation) and MLIR (Multi-Level Intermediate Representation) play a crucial role in enabling flexible, scalable compilation flows and facilitating interoperability across diverse hardware and software platforms. CUDA-Q is an open-source platform for quantum computing that offers a unified programming model designed for CPUs, GPUs, and QPUs to work together effectively in a high-performance system. In this talk, I will introduce the CUDA-Q toolchain and how to take advantage of GPUs for accelerating the development and execution of quantum applications today, highlighting the integration of QIR/MLIR in the compila-tion process. I will discuss current integration strategies before outlining some of the necessary steps to allow for more tightly coupled interactions between classical and quantum resources in future large-scale systems.
FS08.2 How do we make LLVM quantum? Bringing industry-scale compilation through LLVM and MLIR 11:20
Josh Izaac¹

¹ Xanadu

Up until recently, most quantum programming frameworks have been written in Python, and serialize solely the quantum part of the workflow to simplistic string-based representations that are sent to cloud-connected quantum hardware (while the classical part executes locally!). But this ignores the history of classical programming infrastructure, and the fact that no algorithm is purely quantum — there is bound to be expensive and interwoven classical processing, and we need to take this into account. In this talk, I'll chat about how we are approaching this at Xanadu with PennyLane and Cata-lyst, a framework for quantum just-in-time compilation (QJIT). Using QJIT, full quantum-classical pro-grams written in Python are automatically captured and compiled using standard compiler technolo-gies such as MLIR and LLVM — which can lead to not only performance improvements, but increasing-ly new opportunities to interface with quantum computers. I'll also provide an overview of how we are building this compilation infrastructure to support both high-level algorithms and applications, and low-level execution on the next-generation of quantum photonic hardware — as well as the fu-ture challenges and obstacles we face and need to overcome. Overcoming these challenges and build-ing out this infrastructure will require collaboration between many disciplines, from quantum physi-cists and compiler developers to application developers and system engineers.
FS08.3 HUGR: A Structured IR for Scalable Quantum-Classical Programs 11:35
Agustín Borgna¹

¹ Quantinuum

We present HUGR, a production ready structured IR for quantum–classical programs. HUGR is de-signed with expressivity, extensibility, and machine-friendliness in mind, allowing it to capture capa-bilities of next-generation quantum computing devices with dynamic runtimes, enable program analy-sis and optimization beyond the traditional flat circuit picture, and ensure correctness via its linear type system. In this talk we discuss how a quantum-first IR approach compares with MLIR/QIR, how quantum-classical hybrid systems change the balance between compilation trade-offs, and how our stack integrates with the existing LLVM compiler ecosystem.
FS08.4 QIR-Based Software Stack for Trapped-Ion Quantum Computing 11:50
Daniel Borcherding¹

¹ QUDORA

QUDORA Technologies is building trapped-ion quantum computers with a QCCD architecture and its unique Near Field Quantum Control (NFQC) technology, enabling high-fidelity operations. Our soft-ware stack spans from job preparation to execution, with the aim of minimizing ion movement, re-ducing noise impacts, and enabling low-latency hybrid quantum– classical computation. Through the QUDORA Cloud, users can already access our emulator, which incorporates a detailed noise model and is tightly integrated into the entire software stack. This provides researchers with a unique oppor-tunity to explore hybrid quantum–classical algorithms in a realistic environment, accelerating the de-velopment of applications that benefit from the connectivity, coherence time, and scalability of our trapped-ion platform. This talk presents an overview of QUDORA's latest developments and the role of our QIR-based software stack in enabling them.
FS08.5 The MQT Compiler Collection: A Blueprint for a Future-Proof Quantum-Classical Compilation Framework 12:05
Lukas Burgholzer¹, Daniel Haag¹, Yannick Stade¹, Damian Rovara¹, Patrick Hopf¹, Robert Wille¹

¹ Technical University of Munich

As the capabilities of quantum computing hardware

SD04 Empowering Education through Open-Source Hardware (Panel) 11:00 - 12:30 | Auditorium

Chair: Nobert Wehn (RPTU Kaiserslautern-Landau, Chipdesign, Germany)

Co-Chair: Lukas Krupp (RPTU Kaiserslautern-Landau, Chipdesign Germany, DE)

Organizers:

Nobert Wehn (RPTU Kaiserslautern-Landau, Chipdesign, DE)
Lukas Krupp (RPTU Kaiserslautern-Landau, Chipdesign Germany, DE)

SD04.1 EMPOWERING EDUCATION THROUGH OPEN-SOURCE HARDWARE (Panel) 11:00
Luca Benini¹, Luca Carloni², Oscar Gustafsson³, Patrick Haspel⁴, Gerhard Kahmen⁵, Matthew Venn⁶, Norbert Wehn⁷

¹ Università di Bologna and ETH Zurich ; ² Columbia University ; ³ Linköping University ; ⁴ Synopsys ; ⁵ IHP ; ⁶ TinyTapeout ; ⁷ University of Kaiserslautern

The democratization of chip design is a key enabler for innovation and technological sovereignty. Easy Access to chip design ecosystems -comprising Process Design Kits (PDKs), reusable IP blocks, and Electronic Design Automation (EDA) flows - are key to transforming how pupils, students and their educators engage with microelectronics. Such infrastructures enable hands-on chip design experiences across the entire design stack, from system, RTL to GDSII.

LK03 Special Day Lunchtime Keynote: Luca Benini 13:15 - 14:00 | Auditorium

Chair: Milos Krstic (IHP, Germany)

LK03.1 DEMOCRATIZING SILICON: THE RISE OF OPEN-SOURCE EDA AND EUROPE'S STRATEGIC ROADMAP 13:15
Luca Benini¹

¹ Università di Bologna and ETH Zurich

In recent years, the paradigm of open source has moved beyond software to transform the landscape of semiconductor design. Open-source Electronic Design Automation (EDA) tools offer a transformative path toward lowering barriers to entry, ensuring technological sovereignty, and accelerating engineering productivity. This talk provides a comprehensive overview of the current state of the open-source hardware ecosystem and its future potential. We will take a deep dive into Europe's strategic investments in these frameworks and explore how fostering a robust open-source toolchain is the essential catalyst for the next generation of open-source chips. Bio: Luca Benini holds the chair of digital Circuits and systems at ETHZ and is Full Professor at the Università di Bologna. He received a PhD from Stanford University. His research interests are in energy-efficient parallel computing systems, smart sensing micro-systems and machine learning hardware. He is a Fellow of the IEEE, of the ACM, a member of the Academia Europaea and a funding member of the Italian Academy of Engineering and Technology. He is the recipient of various awards, including 2024 IEEE CS Open Source Hardware contribution Award.

TS35 Advanced Computing Paradigm using Emerging Memory Technologies 14:00 - 15:30 | Traviata

Chair: Yuanqing Cheng (Beihang University, CN)

Co-Chair: Chao Fang (KU Leuven, BE)

TS35.1 Near-Memory Architecture for Threshold-Ordinal Surface-Based Corner Detection of Event Cameras 14:00
Hongyang Shang¹, An Guo², Shuai Dong¹, Junyi Yang¹, Ye Ke¹, Arindam BASU¹

¹ City University of Hong Kong ; ² Southeast University

Event-based Cameras (EBCs) are widely utilized in surveillance and autonomous driving applications due to their high speed and low power consumption. Corners are essential low-level features in event-driven computer vision, and novel algorithms utilizing event-based representations, such as Threshold-Ordinal Surface (TOS), have been developed for corner detection. However, the implementation of these algorithms on resource-constrained edge devices is hindered by significant latency, undermining the advantages of EBCs. To address this challenge, a near-memory architecture for efficient TOS updates (NM-TOS) is proposed. This architecture employs a read-write decoupled 8T SRAM cell and optimizes patch update speed through pipelining. Hardware-software co-optimized peripheral circuits and dynamic voltage and frequency scaling (DVFS) enable power and latency reductions. Compared to traditional digital implementations, our architecture reduces latency/energy by 24.7×/1.2× at Vdd = 1.2 V or 1.93×/6.6× at Vdd = 0.6 V based on 65nm CMOS process. Monte Carlo simulations confirm robust circuit operation, demonstrating zero bit error rate at operating voltages above 0.62 V, with only 0.2% at 0.61 V and 2.5% at 0.6 V. Corner detection evaluation using precision-recall area under curve (AUC) metrics reveals minor AUC reductions of 0.027 and 0.015 at 0.6 V for two popular EBC datasets.
TS35.2 PRISM: A Locality-Aware Near-Memory Processing Framework for Scalable Triangle Counting 14:05
Shangtong Zhang¹, Xueyan Wang¹, Yier Jin²

¹ Beihang University ; ² University of Science and Technology of China

Abstract—Triangle Counting (TC) is a fundamental yet expensive graph algorithm. On conventional platforms, its performance is fundamentally limited by the high cost of data movement between processors and memory. Near-Memory Processing (NMP) has emerged to alleviate this issue; however, its efficacy is often compromised by poor data locality, significant set intersection overhead, and prohibitive inter-NMP communication costs when applied to large-scale graphs. To address these challenges, we propose PRISM, a hardwaresoftware co-design framework based on a connectivity-aware graph partitioning strategy. PRISM provides a unified solution that incorporates three key components: a locality-aware algorithm, a heterogeneous processing engine, and a scalable replication mechanism. Specifically, PRISM (1) improves data locality by employing distinct counting methods for partitioned hub and non-hub regions; (2) reduces set intersection overhead through a hybrid engine combining bitmap and content-addressable memory (CAM); and (3) alleviates communication bottlenecks in large graphs by replicating only a small yet critical hub-subgraph. Evaluations on eight real-world datasets demonstrate that PRISM reduces DRAM access volume by 39.49% and achieves an average speedup of 2.05× compared to the state-of-the-art solution. Index Terms—Triangle Counting, Near-Memory Processing, Hardware-Software Co-design, Graph Partitioning.
TS35.3 Leveraging Recurrent Patterns in Graph Accelerators 14:10
Masoud Rahimi¹, Sébastien Le Beux¹

¹ Concordia University

Graph accelerators have emerged as a promising solution for processing large-scale sparse graphs, leveraging the in-situ computation of ReRAM-based crossbars to maximize computational efficiency. However, existing designs suffer from memris-tor access overhead due to the large number of graph partitions. This leads to increased execution time, higher energy consumption, and reduced circuit lifetime. This paper proposes a graph processing method that minimizes memristor write operations by identifying frequent subgraph patterns and assigning them to graph engines, referred to as static, allowing most subgraphs to be processed without a need for crossbar reconfiguration. Experimental results show speed up to 2.38× speedup and 7.23× energy savings compared to state-of-the-art accelerators. Furthermore, our method extends the circuit lifetime by 2× compared to state-of-the-art ReRAM graph accelerators. Index Terms—Processing-in-Memory, ReRAM, graph processing, accelerator.
TS35.4 CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference 14:15
Yanru Chen¹, Runyang Tian¹, Yue Pan¹, Zheyu Li², Weihong Xu³, Tajana Rosing²

¹ University of California, San Diego ; ² UCSD ; ³ Zhejiang University

The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent connectivity. These challenges become particularly acute in the context of multimodal LLMs (MLLMs), as high-dimensional visual inputs are transformed into extensive token sequences, thereby inflating the key–value (KV) cache and imposing substantial data movement overheads to the LLM backbone. To address these issues, we present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference. CHIME leverages the complementary strengths of integrated monolithic 3D (M3D) DRAM and RRAM chiplets: DRAM supplies low-latency bandwidth for attention, while RRAM offers dense, non-volatile storage for weights. This heterogeneous hardware is orchestrated by a co-designed mapping framework that executes fused kernels near data, minimizing cross-chiplet traffic to maximize effective bandwidth. On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54× speedup and up to 246× better energy efficiency per inference as compared to the edge GPU NVIDIA Jetson Orin NX. It sustains 116.5–266.5 token/J compared to Jetson's 0.7–1.1 token/J. Furthermore, it delivers up to 69.2× higher throughput than the state-of-the-art PIM accelerator FACIL. Compared to the M3D DRAM-only design, CHIME's heterogeneous memory further improves energy efficiency by 7% and performance by 2.4×.
TS35.5 DreamRAM: A Fine-Grained Configurable Design Space Modeling Tool for Custom 3D Die-Stacked DRAM 14:20
Victor Cai¹, Jennifer Zhou¹, David Brooks¹, Gu-Yeon Wei¹

¹ Harvard University

3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportunity for a large DRAM design space through 3D integration and expanded total die area. To open and navigate this expansive design space of customized memory architectures that cater to application-specific needs, we introduce DreamRAM, a configurable bandwidth, capacity, energy, latency, and area modeling tool for custom 3D die-stacked DRAM designs. DreamRAM exposes fine-grained design customization parameters at the MAT, subarray, bank, and inter-bank levels, including extensions of partial page and subarray parallelism proposals found in the literature, to open a large previously-unexplored design space. DreamRAM analytically models wire pitch, width, length, capacitance, and scaling parameters to capture the performance tradeoffs of physical layout and routing design choices. Routing awareness enables DreamRAM to model a custom MAT-level routing scheme, Dataline-Over-MAT (DLOMAT), to facilitate better bandwidth tradeoffs. DreamRAM is calibrated and validated against published industry HBM3 and HBM2E designs. Within DreamRAM's rich design space, we identify designs that achieve each of 66% higher bandwidth, 100% higher capacity, and 45% lower power and energy per bit compared to the baseline design, each on an iso-bandwidth, iso-capacity, and iso-power basis.
TS35.6 CIM-Tuner: Balancing the Compute and Storage Capacity of SRAM-CIM Accelerator via Hardware-mapping Co-exploration 14:25
Jinwu Chen¹, Yuhui Shi², He Wang², Zhe Jiang², Jun Yang², Xin Si², Zhenhua Zhu³

¹ School of Integrated Circuits, Southeast University ; ² Southeast University ; ³ Tsinghua University

As an emerging type of AI computing accelerator, SRAM Computing-In-Memory (CIM) accelerators feature high energy efficiency and throughput. However, various CIM designs and under-explored mapping strategies impede the full exploration of compute and storage balancing in SRAM-CIM accelerator, potentially leading to significant performance degradation. To address this issue, we propose CIM-Tuner, an automatic tool for hardware balancing and optimal mapping strategy under area constraint via hardware-mapping co-exploration. It ensures universality across various CIM designs through a matrix abstraction of CIM macros and a generalized accelerator template. For efficient mapping with different hardware configurations, it employs fine-grained two-level strategies comprising accelerator-level scheduling and macro-level tiling. Compared to prior CIM mapping, CIM-Tuner's extended strategy space achieves 1.58$\times$ higher energy efficiency and 2.11$\times$ higher throughput. Applied to SOTA CIM accelerators with identical area budget, CIM-Tuner also delivers comparable improvements. The simulation accuracy is silicon-verified and CIM-Tuner tool is open-sourced at https://github.com/champloo2878/CIM-Tuner.git.
TS35.7 HINT: A Hybrid SRAM–MRAM Compute-In-Memory with INput-aware Skipping SAR-ADC for Energy Efficient Ternary LLMs 14:30
Jaebeom Park¹, Seung Eon Hwang¹, Jongsun Park¹

¹ Korea University

Although large language models (LLMs) show remarkable performance in natural language processing tasks, their deployment on resource-constrained devices remains challenging due to a substantial memory footprint and high-energy consumption. To address these challenges, low-bit and ternary quantization reduce the model size, while hardware approaches such as compute-in-memory (CIM) alleviate the overhead of external memory accesses. However, billions of parameters of LLMs still cause significant data movement, and existing ternary CIM suffers from a low-density bitcell as well as accuracy degradation due to cut-off analog-to-digital converters (ADCs). In this paper, we propose HINT, a CIM architecture incorporating two energy efficient techniques. First, hybrid ternary bitcell leverages the reliability of SRAM and the high-density of MRAM, reducing area and energy overhead. Second, input-aware skipping SAR-ADC exploits input sparsity to skip unnecessary conversion cycles without sacrificing accuracy. On BitNet b1.58 (700M), compared to SRAM-based and eDRAM-based CIM baselines, HINT improves bitcell density by 1.85× and achieves up to 2.67× higher energy efficiency, respectively. By skipping up to 21% of conversion cycles, the proposed ADC improves energy efficiency up to 1.27× while maintaining model accuracy.
TS35.8 ReBIT: A ReRAM-Based In-Situ Training Accelerator with Robustness Against Stochasticity 14:35
Peng Dang¹, Wei Wang², Yintao He¹, Huawei Li¹

¹ SKLP, Institute of Computing Technology, Chinese Academy of Sciences ; ² Pengcheng Laboratory

In-situ training architectures based on resistive random-access memory (ReRAM) have attracted significant attention due to their exceptional energy efficiency. However, the inherent stochasticity of ReRAM devices severely degrades training convergence. To address this challenge, this work proposes a ReRAM-based in-situ training accelerator (ReBIT) architecture. The ReBIT integrates ReRAM with static random-access memory (SRAM) devices, leveraging the deterministic characteristics of SRAM-based computations to suppress the inherent stochasticity of ReRAM devices. Experimental results demonstrate that the ReBIT architecture achieves convergence performance comparable to full-precision software training.
TS35.9 Re-RIS: A Reconfigurable 3D RRAM In-Sensor Architecture for Low-Latency Machine Vision 14:36
Shiyang Li¹, Lixia Han², Siyuan Chen¹, Lifeng Liu¹, Peng Huang¹

¹ School of Integrated Circuits, Peking University ; ² College of Integrated Circuits, Nanjing University of Aeronautics and Astronautics

Cutting-edge machine vision applications impose stringent latency and energy efficiency demands on edge devices. To address these demands, In-Sensor Computing (ISC) architectures aim to eliminate data movement overhead, while 3D RRAM technology provides the hardware foundation of high memory density and massive computing parallelism. However, existing ISC architectures rely on static resource allocation, failing to address the dynamic "shifting bottleneck" in CNNs—where early layers are compute-bound and later layers are readout-bound. To address this, we propose Re-RIS, a Reconfigurable 3D RRAM In-Sensor architecture. By dynamically switching hardware granularity between high-parallelism and high-throughput modes, Re-RIS optimizes resource utilization for varying layer characteristics. Experimental results on VGG-16 demonstrate an end-to-end latency of 0.93 ms, achieving a 75% reduction compared to static baselines, with an energy efficiency of 244.6 TOPS/W and an area efficiency of 1.85 TOPS/mm².

TS36 Next-Generation Memory Systems for AI acceleration 14:00 - 15:30 | Tosca

Chair: Alessandro Cilardo (University of Naples Federico II, IT)

Co-Chair: Elena Ioana Vatajelu (TIMA - CNRS, FR)

TS36.1 Enhanced CXL Pooled Memory System for Scalable AI via Embedding Access Prediction 14:00
Jongho Park¹, Hoyeon Lee¹, Seohyun Kim², Minho Ha³, Byungil Koh³, Jungmin Choi³, Yeseong Kim¹

¹ DGIST ; ² POSTECH ; ³ SK Hynix Inc.

The embedding operation, pivotal in modern AI applications such as recommendation systems and natural language processing, transforms high-dimensional sparse data into dense vector representations. However, embedding tables are memory-intensive and pose significant challenges in DRAM-based architectures due to their substantial size. This paper introduces Sage, a scalable architecture for embedding operations in CXL-based pooled memory systems. Sage employs advanced caching and prefetching strategies, leveraging an online clustering algorithm to predict embedding table access patterns, and selectively uses Near-Data Processing (NDP) to mitigate the latency associated with CXL memory access. Our comprehensive evaluation demonstrates that Sage significantly enhances throughput and efficiency, providing a cost-effective solution for large-scale AI models. Our experimental results demonstrate that Sage enhances throughput by 2.84 × as compared to conventional memory management systems.
TS36.2 FARM: Fast Acceleration of Random forests via in-Memory traversal 14:05
Aymen Ahmed¹, Valeria Bertacco¹

¹ University of Michigan

Mainstream artificial intelligence (AI) solutions commonly rely on deep neural networks (DNNs) for their training and inference. Such AI models are often impenetrable to human interpretation, limiting their potential for adoption in sensitive domains, where an understanding of the root factors that led to a specific inference outcome is critical. In this context, tree-based ensemble models, such as random forests (RFs), XGBoost, and LightGBM, have recently risen as a key family of AI models that are "interpretable". However, their limited performance is far from fulfilling the needs of time-critical applications, thus hindering their adoption. This work identifies data retrieval inefficiencies in several tree-based inference models and proposes FARM, a novel hardware solution to accelerate inference on those models. FARM comprises two hardware innovations: a Processing-in-Memory (PIM) accelerator that performs the key computations of tree traversal directly within the HBM banks, and a Skipped Query Groups (SQG) design that bypasses unnecessary data movement and computation by coalescing burst memory activity. Our evaluation shows that FARM delivers up to 12x performance improvement for the three ensemble models studied (RFs, XGBoost, and LightGBM), compared to a GPU–HBM baseline.
TS36.3 Endor: Exploit Nearly-Decode-Only Opportunities of LLM Reasoning on Near-Memory Architecture 14:10
Jun Liu¹, Tianlang Zhao¹, Shiyi Liu², Jiancai Ye¹, Lin Li³, zhen yu¹, Li Ding¹, Hao Zhou¹, Zhenhua Zhu³, Xuefei Ning³, Yuan Xie², Yu Wang³, Guohao Dai¹

¹ Shanghai Jiao Tong University ; ² Hong Kong University of Science and Technology ; ³ Tsinghua University

Reasoning with Large Language Models (LLMs) has become a pivotal research topic because their logical abilities significantly surpass those of standard LLMs. LLM reasoning typically forms multiple chains of thought, action-by-action, and selects the best one as the final answer. However, the inference overhead of LLM reasoning is more than an order of magnitude higher than that of LLM. Despite the emerging shift towards memory-optimized algorithms and near-memory hardware, we still face the following challenges: (1) Existing memory-centric algorithms (e.g., KV cache technique) have low computational utilization (< 4% on NVIDIA A100 GPU) due to intensive memory access for inter-action data. (2) Emerging hardware architectures (e.g., near-memory processing) fail to fully utilize the inherent parallelism due to dependencies among models, leading to low utilization of memory bandwidth. To tackle these challenges, we propose Endor, a hardware-algorithm co-design to accelerate the inference of LLM reasoning efficiently. We identify that the auto-regressive decoding of LLM reasoning changes from the token level to the action level in terms of the computing paradigm. At the algorithm level, we propose a "nearly-decode-only" method which encompasses an efficient inter-action cache reuse method and a prediction-based pipeline optimization to reduce computation overhead. At the hardware level, we propose Endor-NMP, a near-memory accelerator featuring a score-aware cache management architecture and a heterogeneous mapping dataflow. Endor fully exploits both inter-action and intra-action parallelism to improve memory bandwidth utilization. Experimental results demonstrate that neither existing algorithms nor hardware can achieve the expected acceleration. Endor achieves an end-to-end average speedup of 2.97× and 2.52× compared to the NVIDIA A100 GPU and advanced LLM accelerators on multiple models and datasets.
TS36.4 Khepri: Crystallizing TAGE for Memory Efficient Prewarm in Serverless Computing 14:15
Zengshi Wang¹, Zhiyuan Zhang¹, Zhuoyuan Yang¹, Kanheng Jiang¹, Chao Fu², Jun Han¹

¹ Fudan University ; ² Shao-Chips Laboratory

As an increasingly popular cloud computing model, serverless computing suffers from performance degradation caused by microarchitectural cold start. Previous studies identifies the front-end as the bottleneck and explores solutions such as instruction prefetching and restoring Branch Target Buffer. However, they fail to prewarm the Conditional Branch Predictor (CBP), because the large size of its core component, the TAgged GEometric history length predictor (TAGE), makes it impractical to be saved and restored. This paper observes the predictive sparsity of TAGE, where only a small subset of entries can dominate the predictor's coverage and accuracy. We introduce Khepri, a memory efficient CBP prewarming mechanism that uses a TAGE Crystallization algorithm to identify these dominant entries. Khepri records them in main memory and restores them to prewarm TAGE at the next invocation. Khepri achieves a 1.57x speedup over the baseline and outperforms state-of-the-art technique by 14%, requiring only 1.54KB in main memory on average.
TS36.5 MACAM: A Flexible Computing-in-Memory Accelerator for Sparse Matrix-Dense Vector Multiplication 14:20
Xiaoyu Zhang¹, Rui Liu², Zerun Li¹, yinhe han³, Xiaoming Chen¹

¹ Institute of Computing Technology, Chinese Academy of Sciences ; ² Institute of Computing Technology, Chinese Academy of Science ; ³ Institute of Computing Technology,Chinese Academy of Sciences

Sparse Matrix-Dense Vector Multiplication (SpMV) is an important computational primitive which is bounded by memory bandwidth. Computing-in-memory (CIM) is regarded as an effective approach to reduce data movement. Due to the lack of flexibility in architectural design, current CIM-based SpMV accelerators struggle to simultaneously support high-parallelism computations and the storage of irregular sparse data. We propose a flexible CIM-based accelerator named MACAM for high-precision SpMV. Each array of MACAM can be configured into sparse or dense modes according to the local-sparsity of the sparse matrix. We propose a unified data layout approach that enables MACAM to meet the data storage requirements of different modes. We also propose a sparse storage format and a workload-balancing approach to further improve the performance of MACAM. Experiments show that MACAM achieves 167.26× speedup and 286.04× energy saving over the GPU baseline. MACAM also achieves 97.41× and 6.56× speedup and 213.65× and 10.06× energy saving compared with two state-of- the-art CIM-based SpMV accelerators.
TS36.6 Zion: A Comprehensive, Adaptive, and Lightweight Hardware Prefetcher 14:25
Vadim Biryukov¹, Xiaoyang Lu², Zirui Liu³, Kaixiong Zhou⁴, Xian-He Sun⁵

¹ Illinois Institute Of Technology / Lewis University ; ² Illinois Institute of Technology ; ³ University of Minnesota ; ⁴ North Carolina State University ; ⁵ Illinois Institute Of Technology

As the gap between processor and memory performance widens, optimizing data access performance becomes increasingly critical. Hardware prefetching is a widely used technique to hide long-latency off-chip memory accesses, but state-of-the-art prefetchers struggle with diverse and dynamic access patterns. Their limited adaptability leads to excessive storage overhead and reduced effectiveness under memory-intensive workloads. We propose Zion, a comprehensive, adaptive, and lightweight hardware prefetcher for memory-intensive workloads. At its core, Zion uses Independent Temporal-Spatial Modules (ITSM) for broad pattern coverage and runtime adaptability to diverse memory access patterns. Moreover, Zion leverages runtime feedback to dynamically guide prefetching decisions and maintain efficiency under memory pressure. Extensive multi-core evaluations show that Zion consistently outperforms state-of-the-art prefetchers, achieving up to 43.2% performance improvement on SPEC and 43.0% on self-attention workloads, while maintaining low overhead and broad effectiveness.
TS36.8 Boosting LLC Bandwidth Utilization in GPUs through Adaptive Fine-Grained Data Migration 14:30
Jihun Yoon¹, Sungbin Jang¹, Seokin Hong¹

¹ Sungkyunkwan University

Modern server-grade GPUs (e.g., NVIDIA A100) integrate hundreds of cores and tens of memory partitions, providing massive compute capability and memory bandwidth. However, the increased scale amplifies interconnect overhead between cores and memory partitions. To mitigate this, NVIDIA A100 clusters multiple cores and memory partitions into two large groups, thereby simplifying interconnect complexity. Unfortunately, this partitioning introduces a new limitation: remote partition accesses. A core accessing a remote memory partition incurs higher latency and lower bandwidth compared to local accesses. In this paper, we propose a cache-line migration mechanism across partitions to alleviate remote memory access overhead. Our design is motivated by two key observations: (1) conventional GPUs employ limited and often ineffective optimizations for remote access handling, and (2) GPU applications typically exhibit high temporal locality, where a specific partition of cores makes frequent memory accesses for the same data within short time intervals. Leveraging these insights, we dynamically migrate cache-lines to the local partition where the requesting core resides. Experimental results demonstrate that our approach achieves up to 1.24× speedup over the baseline with NVIDIA A100-like replication, highlighting its effectiveness in reducing remote access penalties.
TS36.9 A High-Performance Neural Rendering Accelerator with Novel Multi-Level Ray Scheduling and Dual-Process Backend 14:35
Wenkai Zhou¹, Yuefeng Zhang¹, Cheng Zheng¹, Binzhe Yuan¹, Junsheng Chen¹, Luntian Zhang¹, Xiangyu Zhang¹, Pingqiang Zhou¹, Jingyi Yu¹, Xin Lou¹

¹ ShanghaiTech University

Neural rendering enables photorealistic scene reconstruction but remains difficult to deploy on edge devices due to intensive computation, redundant sampling, and memory bandwidth constraints. This work presents a high-performance neural rendering accelerator for real-time embedded rendering. The proposed design integrates: (1) a dual-process backend with fused micro-MLPs to significantly improve sample processing efficiency, (2) multi-resolution spatial partitioning with adaptive ray clustering to exploit sparsity and achieve over 95% cache hit rate, and (3) a multi-level scheduling framework with proactive prefetching to reduce MLP stalls. Implemented on FPGA, the prototype achieves 94.7 FPS at 800*800 resolution with 6.4 W power consumption. An ASIC implementation in 28 nm technology sustains 440 FPS at 268 mW. Experimental results demonstrate state-of-the-art performance and energy efficiency while preserving rendering quality above 30 dB PSNR.
TS36.10 Data Distribution-Aware Analog/Digital Conversion Strategy for Energy-Efficient Memristive In-Situ Accelerators 14:36
Taoming Lei¹, Heng Zhou¹, Bing Wu¹, Wei Tong¹, Dan Feng¹

¹ Huazhong University of Science and Technology

Memristive in-situ computing offers energy-efficient DNN acceleration, but faces ADC-induced energy bottlenecks. We observe that bitline outputs exhibit significant non-uniformity and cycle-to-cycle variation, rendering conventional A/D conversion schemes suboptimal. We thus propose a data distribution-aware A/D conversion strategy that predicts key bits of digital outputs and skips unnecessary steps, with a switching mechanism adapting the optimal conversion method across cycles. Implemented via a reconfigurable SAR-ADC, our approach significantly reduces the energy consumption of in-situ accelerators.

TS37 Low-power, energy-efficient, and thermal-aware design 14:00 - 15:30 | Aida

Chair: Alberto Marchisio (NewYork University Abu Dhabi, US)

Co-Chair: Leticia Maria Bolzani Poehls (IHP GmbH - Leibniz Institute for High Performance Microelectronics, DE)

TS37.1 RATuner: Retrieval-Augmented VLSI Flow Design Parameter Tuning Framework 14:00
Peng Xu¹, Ziyang Yu¹, Yuan Pu¹, Xinyun Zhang², Donger Luo³, Hao Geng², Siyuan Xu⁴, Tsung-Yi Ho¹, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² ShanghaiTech University ; ³ Shanghaitech University ; ⁴ Huawei Noah's Ark Lab

Optimizing configurable parameters in the Very-large-scale Integration (VLSI) design space is a key process for achieving high Quality-of-Result (QoR) metrics, including performance, power, and area. However, this task is severely challenged by the enormous design space, the lack of analytical mapping functions with QoR, and the high computational cost of evaluating each design choice. While Bayesian Optimization (BO) offers a balanced trade-off between exploitation and exploration, standard BO methods typically do not incorporate domain knowledge specific to VLSI design flow parameters. To address these limitations, we propose RATuner, a retrieval-augmented framework for high-dimensional VLSI Flow Design Space Exploration (DSE). RATuner integrates domain knowledge through a document-retrieval-based embedding method to guide Bayesian optimization, using design parameter embeddings constructed from EDA documentation. It further employs a stage-wise causal attention mechanism to model both intra-stage parameter interactions and the critical inter-stage causal dependencies present in the sequential VLSI design flow. Finally, an iterative Bayesian optimization strategy is utilized to achieve an efficient trade-off between exploitation and exploration. Experimental results on RISC-V and Blackparrot benchmarks show that RATuner achieves up to 33% improvement in Pareto-driven QoR metrics compared to representative state-of-the-art VLSI Flow DSE methods. The proposed framework bridges the gap between black-box optimization and VLSI domain expertise by incorporating domain knowledge, thereby improving the efficiency and quality of automatic VLSI design closure.
TS37.2 SATA: Sparsity-Aware Scheduling for Selective Token Attention 14:05
Zhenkun Fan¹, Zishen Wan¹, Che-Kai Liu¹, Ashwin Lele², Win-San Khwa³, Bo Zhang³, Meng-Fan (Marvin) Chang⁴, Arijit Raychowdhury¹

¹ Georgia Institute of Technology ; ² TSMC ; ³ TSMC Corporate Research ; ⁴ National Tsing Hua University

Transformers have become the foundation of numerous state-of-the-art AI models across diverse domains, thanks to their powerful attention mechanism for modeling long-range dependencies. However, the quadratic scaling complexity of attention poses significant challenges for efficient hardware implementation. While techniques such as quantization and pruning help mitigate this issue, selective token attention offers a promising alternative by narrowing the attention scope to only the most relevant tokens, reducing computation and filtering out noise. In this work, we propose SATA, a locality-centric dynamic scheduling scheme that proactively manages sparsely distributed access patterns from selective Query-Key operations. By reordering operand flow and exploiting data locality, our approach enables early fetch and retirement of intermediate Query/Key vectors, improving system utilization.We implement and evaluate our token management strategy in a control and compute system, using runtime traces from selective-attention-based models. Experimental results show that our method improves system throughput by up to 1.76× and boosts energy efficiency by 2.94×, while incurring minimal scheduling overhead.
TS37.3 TT-Edge: A Hardware–Software Co-Design for Energy-Efficient Tensor-Train Decomposition on Edge AI 14:10
Hyunseok Kwak¹, Kyeongwon Lee¹, Kyeongpil Min¹, Chaebin Jung¹, Woojoo Lee¹

¹ Chung-Ang University

The growing demands of distributed learning on resource-constrained edge devices underscore the importance of efficient on-device model compression. Tensor-Train Decomposition (TTD) offers high compression ratios with minimal accuracy loss, yet repeated singular value decompositions (SVDs) and matrix multiplications can impose significant latency and energy costs on low-power processors. In this work, we present TT-Edge, a hardware–software co-designed framework aimed at overcoming these challenges. By splitting SVD into two phases—bidiagonalization and diagonalization, TT-Edge offloads the most compute-intensive tasks to a specialized TTD-Engine. This engine integrates tightly with an existing GEMM accelerator, thereby curtailing the frequent matrix–vector transfers that often undermine system performance and energy efficiency. Implemented on a RISC-V-based edge AI processor, TT-Edge achieves a 1.7× speedup compared to a GEMM-only baseline when compressing a ResNet-32 model via TTD, all while reducing overall energy usage by 40.2%. Notably, these gains come with only a 4% increase in total power and minimal hardware overhead—enabled by a lightweight design that reuses GEMM resources and employs a shared floating-point unit. Our experimental results on both FPGA prototypes and post-synthesis power analysis at 45 nm demonstrate that TT-Edge effectively addresses the latency and energy bottlenecks of TTD-based compression in real-world edge environments.
TS37.4 Smart Imager with Object Detection Exploiting Edge-Frame-Base Processing and Bounding Box Extraction for μW Power Purely-Harvested Sensor Nodes 14:15
Hayate Okuhara¹, Udari De Alwis¹, Liu Yue¹, Karim Ali Ahmed¹, Massimo Alioto¹

¹ National University of Singapore

Battery-less and cost-sensitive vision nodes are becoming essential in IoT-scale sensor networks, where in/near-sensor AI enables local recognition while minimizing data transmission. However, achieving multi-class object detection under available peak power budgets (<10 µW) and low-cost fabrication remains a major challenge. Existing smart imagers either lack on-chip intelligence or exceed such power budgets due to costly sensing and computing. This paper presents a fully-integrated smart imager performing multi-class object detection at 8.51 μW (equivalent to the power from a 7mm × 6mm harvester at 300 lux) in standard 180nm CMOS. The system processes 1-bit edge-extracted frames, applies tile-level novelty detection for bounding-box ROI extraction, and computes CENTRIST features over cropped regions. A low-power approximate linear SVM classifies detected objects at 130 pW/pixel power. Unlike prior architectures, the proposed system maintains full image readout, supports flexible learning-based inference, and avoids custom optics and CIS processing. This makes it the first battery-less smart imager capable of flexible, multi-object detection in low-cost standard CMOS technology.
TS37.5 Input Sparsity Aware In-Memory Computing Macro Based on SOT-MRAM Multi-Level Cell for Efficient Deep Neural Network Acceleration 14:20
Chao Wang¹, Qihang Gao¹, Xianzeng Guo¹, Zhongzhen Tong¹, Zhaohao Wang¹, Weisheng Zhao¹

¹ Beihang University

Deep neural network (DNN) technology has gained widespread applications, but its high energy demands continue to drive the advancement of low-power computing architectures, particularly in in-memory computing (IMC) architectures based on non-volatile memory. Among these, spin-transfer torque magnetic random-access memory (STT-MRAM)-based IMC architectures have achieved some progress, but their performance remains constrained by limited resistance and binary characteristics. By contrast, the next-generation spin-orbit torque MRAM (SOT-MRAM) offers superior magnetic tunnel junction (MTJ) resistance and more flexible cell structures, presenting significant potential for energy-efficient IMC implementation. In this work, leveraging the ultra-high MTJ resistance and the separation of read/write paths in SOT-MRAM, we propose a multi-level cell (MLC) structure-based high energy-efficiency IMC architecture (MLC-SOT-IMC), which performs standard multiplication operations by optimizing the conductance mapping paradigm. The proposed architecture not only maintains high inference accuracy but also significantly enhances integration density and reduces the overhead per bit. Additionally, a self-terminating time-to-digital converter (TDC) readout circuit, which is dependent on input sparsity, is introduced to eliminate the excess power consumption associated with ineffective pulses after readout completion. Ultimately, the proposed MLC-SOT-IMC architecture achieves an inference energy efficiency of 6388.98 1-bit TOPS/W under an input sparsity of 50%, with the peak energy efficiency reaching 8426.19 1-bit TOPS/W at an input sparsity of 90%.
TS37.6 Dynamic Voltage, Body Bias and Frequency Scaling for FD-SOI-Based Low-Power Edge Processors 14:25
Shrihari Gokulachandran¹, Navneet Jain², Andreas Gerstlauer³

¹ University of Texas at Austin ; ² GlobalFoundries ; ³ The University of Texas at Austin

Fully depleted silicon-on-insulator (FD-SOI) has emerged as a proven technology for energy-efficient edge deployments. FD-SOI offers a wide body biasing range, providing an additional knob for trading off energy and performance that has remained underexplored. Traditional approaches have applied body biasing only statically, at coarse granularity, or limited to corner tightening for adaptive PVT compensation. This overlooks the potential of body biasing for power-energy tuning on top of traditional dynamic voltage and frequency scaling (DVFS) in response to system or workload characteristics. In this paper, we explore simultaneous dynamic voltage, body bias, and frequency scaling (DVBFS) to maximize power and energy efficiency on edge processors. Optimal operating points depend on the trade-off between leakage and dynamic energy, which is affected differently by supply voltage and bias while also varying with synthesis conditions. We introduce a methodology and automated framework to jointly determine optimal synthesis parameters and runtime DVBFS settings. We show that for workloads dominated by active execution, synthesizing for maximum performance together with DVBFS is near-optimal. By contrast, when workloads alternate between active and significant idle periods, using DVBFS with a netlist optimized for leakage delivers superior energy efficiency. We evaluate our approach on an embedded RISC-V processor running TinyML applications. Results demonstrate up to 29% energy savings from DVBFS compared to DVFS.
TS37.7 GLEAM: A Graph-Learning Enhanced Adaptive Metaheuristic for Power-Aware Scheduling on Heterogeneous Cyber-Physical Systems 14:30
Amirhossein Ansari¹, Mohsen Ansari¹, Sepideh Safari², Alireza Ejlali¹, Joerg Henkel³

¹ Sharif University of Technology ; ² Institute for Research in Fundamental Sciences (IPM) ; ³ KIT

The increasing complexity of embedded and Cyber-Physical Systems (CPS) has accelerated the adoption of heterogeneous multi-core architectures, which combine performance and energy efficiency. However, scheduling dependent tasks on such platforms introduces significant challenges due to strict real-time constraints, high energy consumption, and the NP-hard nature of task mapping. This paper proposes a novel hybrid scheduling framework to jointly optimize energy efficiency and timeliness for Directed Acyclic Graph (DAG) applications. The framework operates in three tiers: first, a Genetic Algorithm (GA) performs a global search to determine near-optimal task-to-core mappings; second, a Dynamic Voltage and Frequency Scaling (DVFS) manager is integrated into the GA's fitness function to accurately capture energy-performance trade-offs; and third, a Graph Neural Network (GNN) is trained to imitate the GA+DVFS policy, enabling fast and high-quality online scheduling decisions. Experimental results demonstrate that the proposed approach achieves a balanced trade-off between power consumption and deadline satisfaction, while the GNN significantly accelerates scheduling without compromising solution quality. Our GLEAM method reduced energy consumption on average by 49.08\% and improved the makespan on average by 27.03\% compared to baseline methods.
TS37.8 SA-ANT: Efficient Low-Bit Group-Wise Quantization for Large Language Models via Sign-Asymmetric Adaptive Numeric Type 14:35
Xinkuang Geng¹, Siting Liu², Hui Wang¹, Jie Han³, Honglan Jiang¹

¹ Shanghai Jiao Tong University ; ² ShanghaiTech University ; ³ University of Alberta

Large language models (LLMs) have demonstrated remarkable potential across diverse domains; meanwhile, their large parameter sizes pose substantial inference costs, motivating the need for efficient low-bit quantization. Group-wise quantization, which adopts finer granularity, has been widely used to improve low-bit quantization performance. Several adaptive numeric types have been proposed to further enhance low-bit group-wise quantization; however, they construct quantization grids based on symmetric numeric types, which limits their ability to model asymmetric distributions. To address this limitation, we propose SA-ANT, a sign-asymmetric adaptive numeric type for efficient low-bit group-wise quantization. SA-ANT constructs quantization grids separately on the positive and negative sides, enabling adaptive support for asymmetric and non-uniform distributions. Furthermore, the carefully designed SA-ANT not only reduces quantization errors but also ensures a unified computing across different sub numeric types, thereby facilitating hardware efficiency. To accelerate LLM inference, we develop (1) a quantization framework that transforms LLM weights into the SA-ANT and adaptively selects the sub numeric type for each group, and (2) an accelerator that maps SA-ANT inference to low-bit INT operations. Experimental results show that SA-ANT delivers 3.92%-5.57% higher accuracy than state-of-the-art adaptive numeric types under 3-bit weight quantization, while also enabling 7.84%-44.65% area savings and 7.80%-43.88% power reductions.
TS37.9 Efficient Down-sampling in Hybrid Neural Networks using Adversarial Autoencoders 14:40
Jonghyeon Nam¹, JoonSeok Kim², Eunji Kwon³, Seokhyeong Kang¹

¹ Pohang University of Science and Technology ; ² Department of EE, POSTECH ; ³ Kookmin University

Early convolutional layers in hybrid neural networks enable efficient down-sampling for mobile applications but pose a significant burden in terms of inference latency and energy consumption. In this paper, we propose a method of replacing the conventional down-sampling block with lightweight autoencoders to enhance the hybrid neural network's applicability in edge devices. Furthermore, we enhance the performance of the autoencoder-based down-sampling network by training the autoencoder to extract features that are more compatible with its succeeding layers. By applying our proposed method to MobileViTV2-050, we achieve up to a 1.23× speedup in on-device inference latency and a 47% decrease in Energy-Delay Product with only a 1.0% performance decrease on the ImageNet-1K dataset. We believe our approach can be generalized to a wide range of hybrid neural networks, offering an effective balance between accuracy, latency, and energy efficiency for mobile deployments.
TS37.10 Evaluation of Thermal and Power integrity and its Impact on Performance for 3D Memory-on-Logic CPUs with FSPDN and BSPDN 14:41
Yumeng Wang¹, Xincheng Liu¹, Hu Zhou¹, Linqiu Wang¹, Zexu Leng¹, Haolan Yang², Zhuojun Chen³, Zhiyong Zhang¹, Lianmao Peng¹, Rongmei Chen¹

¹ Peking University ; ² Xiangtan University ; ³ Hunan University

While three-dimensional (3D) Memory-on-Logic integration benefits high-performance computing (HPC), it faces critical bottlenecks in power delivery and thermal management. This paper presents a comprehensive power, performance, area, and thermal (PPAT) evaluation of a 3D Memory-on-Logic CPU utilizing Frontside Power Delivery Network (FSPDN) and Backside Power Delivery Network (BSPDN). Our analysis reveals a fundamental trade-off: while BSPDN significantly improves power integrity by reducing logic IR drop by 7.7 × (vs. 3D FSPDN CPU) and 12× (vs. 2D CPU), the extreme substrate thinning required for backside connectivity severely impedes lateral heat dissipation, raising peak temperatures by ~8°C (vs. 3D FSPDN CPU) and ~12°C (vs. 2D CPU). By incorporating thermal-electrical coupling into a spatial-temperature-aware timing analysis, we demonstrate that unlike 3D FSPDN which yields negligible gains over 2D case due to through-silicon via bottlenecks, the superior power integrity of BSPDN decisively outweighs thermal penalties, achieving a net ~30% performance improvement over the 2D counterpart.

FS09 Focus Session: Autonomous Systems Dependability in the Era of AI - Design Challenges in Security, Reliability and Certification 14:00 - 15:30 | Auditorium

The design of embedded safety-critical systems such as those used in next-generation automotive and autonomous platforms, is increasingly challenged by escalating system complexity, hardware?software heterogeneity, and the integration of intelligent, data-driven components. Ensuring dependability in such systems requires a holistic approach that spans multiple abstraction layers and encompasses both design-time and run-time assurance. Traditional methods for reliability, safety, and security management often fall short in addressing the dynamic and uncertain behaviors introduced by Artificial Intelligence (AI) and Machine Learning (ML) components, especially under stringent real-time, power, and safety constraints. While AI and ML offer powerful predictive, adaptive, and self-optimizing capabilities that can enhance system dependability, their inherent non-determinism, data-dependence, and lack of formal guarantees introduce new challenges for verification, validation, and certification. This session will explore emerging methodologies, architectures, and frameworks for designing dependable autonomous and embedded systems in the era of AI. It will highlight advances in reliability modeling, secure system design, and certification approaches that account for imperfect, learning-enabled components, aiming to bridge the gap between AI innovation and certifiable system-level dependability.

Chair: Partha Pratim Pande (Washington State University, US)

Co-Chair: Behnaz Ranjbar (Ruhr University Bochum, DE)

Organizers:

Akash Kumar (Ruhr University Bochum, DE)
Behnaz Ranjbar (Ruhr University Bochum, DE)

FS09.1 ML for Cross-Layer Reliability in Autonomous Safety-Critical Systems 14:00
Behnaz Ranjbar¹, Akash Kumar¹

¹ Ruhr University Bochum

The increasing complexity and autonomy of embedded safety-critical systems, such as autonomous vehicles, demand advanced methods to ensure system reliability under uncertain operating conditions. In modern automotive applications, in addition to increasing the complexity of the application, the technology scaling in embedded platforms has dramatically exacerbated the rate of manufacturing defects and physical fault rates. Increased unreliability has led to the increasingly complex problem of ensuring the reliable execution of applications on increasingly unreliable hardware. Since failure in such safety-critical systems may cause catastrophic consequences, improving reliability under all circumstances of stress and environmental changes plays a vital role in these systems. To design a reliable system, fault mitigation and reliability methods need to be applied in multiple system abstraction layers, like hardware, application, and system software. Cross-layer solutions are applied to provide application-specific and low-cost fault-tolerance by distributing the fault mitigation activity across the layers. The system should adapt to the operating environment during run-time to optimize other metrics. ML techniques are suitable for adaptive run-time operation, including fault prediction, detection, and tolerance. However, employing ML may introduce new challenges in ensuring requirements and trustworthiness. The talk first underscores that ensuring reliability in autonomous embedded systems requires addressing faults across all stack levels. Such cross-layer analysis is critical in domains like autonomous vehicles, where faults can propagate through hardware and software layers and selective protection mechanisms must be applied to maintain safety. Then, this talk highlights the potential of ML to achieve cross-layer reliability assurance in future safety-critical autonomous architectures, as well as its upcoming challenges.
FS09.2 Cross-Layer Security for Safety-Critical Automotive Systems 14:23
Kirankumar Raveendiran¹, Sudeep Pasricha¹

¹ Colorado State University

Modern automotive systems are rapidly evolving into complex cyber-physical ecosystems, integrating advanced computation, communication, and control across multiple layers—from hardware to software to network interfaces. As vehicles become increasingly autonomous and connected, ensuring robust security across these layers is paramount, especially for safety-critical functions where failures can have catastrophic consequences. This talk presents a cross-layer security framework tailored for automotive cyber-physical systems (CPS). The unique threat landscape of automotive CPS will be discussed, highlighting vulnerabilities at the hardware, firmware, operating system, and communication layers. Promising approaches will be discussed that integrate lightweight cryptographic primitives, attestation protocols, and real-time anomaly detection within vehicle networks and multimodal perception pipelines and using machine-learning. We will describe how cross-layer coordination can significantly enhance resilience against both remote and physical attacks. The trade-offs between security, performance, and energy efficiency will also be discussed. Lastly, insights into future research directions, including adaptive security policies and integration with emerging standards for automotive cybersecurity will be presented.
FS09.3 Designing Safe Autonomous Systems with Imperfect ML Components 14:45
Samarjit Chakraborty¹

¹ UNC Chapel Hill

Modern autonomous systems now routinely use machine learning (ML) components. To ensure safety and reliability, the goal has been to use increasingly sophisticated ML algorithms to improve inference accuracy and new neural network verification techniques to ensure reliability and certification. However, it is proving to be nearly impossible to guarantee perfect inference accuracy and consequently ensuring dependability in autonomous systems with ML components is turning out to be one of the biggest design and verification challenges. In this talk we will pivot from the conventional aim of trying to design perfect ML inference engines and will instead characterize their imperfection as a first class citizen in the design process. Towards this, we will focus on system-level safety--instead of the safety or inference accuracy of ML components in isolation--and derive bounds on the allowed imperfectness of ML inferences. Second, we will use this characterization to ensure that controllers in autonomous systems can work with inaccurate ML inferences to ensure the satisfaction of the desired system-level safety specifications. We will illustrate this approach using multiple case studies to show how it can not only ensure certifiable safety, but also lead to much more cost-effective designs compared to what is possible today.
FS09.4 Challenges and Regulations in Reliable and Secure AI Sensing Systems 15:08
Cecilia Carbonelli¹

¹ Infineon Technologies

Our contribution presents a practical and actionable roadmap for building trustworthy, user centric AI at the edge for safety and mission critical sensing applications. The approach centers on compliance, transparency, and reliability, combining lightweight uncertainty quantification, interpretable rule extraction across sensor pipelines, and calibration strategies that link performance, explainability, and user experience. Illustrative examples include environmental sensing and radar applications on resource constrained devices. We will show how adopting Trustworthy AI engineering best practices not only aligns with evolving governance (such as the EU AI Act) but delivers strategic benefits, accelerating innovation, unlocking new business opportunities, and strengthening market differentiation. As requirements and standards remain in flux, it is important to foster AI literacy and build awareness and competencies across organizations to shape the future of responsible AI and sensing. This is highly relevant for automotive applications, where edge AI powers ADAS and in cabin sensing under stringent functional safety and cybersecurity expectations, making uncertainty aware, interpretable, and well calibrated sensor pipelines essential for safety cases, homologation, and long-term reliability in harsh operating conditions.

ES02 Sponsors Executive Session II 14:00 - 15:00 | Rigoletto

Chair: Anton Klotz (Fraunhofer, DE)

Co-Chair: Andrea Kells (Arm, UK)

ES02.1 EDA Agentic AI, Current Status, Challenges and Future Roadmap 14:00
Christos Sotiriou¹

¹ Univesity of Thessaly - Department of Electrical and Computer Engineering (EECE)

Agentic AI is rapidly moving from experimental demos to active deployment in production EDA flows. In this talk, we will present the current state of EDA Agentic AI, focusing on its current capabilities and challenges, as well as requirements for building such as flow. We will explore practical problem statements, and specifically explore the areas of (1) RTL code checking and fixing, (2) specification to automated Testbench construction for design verification, and (3) Synthesis to GDSII Agentic flow. We will detail challenges and constraints of EDA Agentic AI, including LLM infrastructure requirements, IP and data protection and flow determinism. We will present the capabilities and challenges of NPU/CPU based SLMs (Small Language Models), which offer secure, on-premise deployment with low-cost edge devices such as laptops and edge inference servers. To leverage SLMs for architectural reasoning and complex multi-step planning we investigate and present approaches on how to bridge the SLM, LLM gap. Looking ahead, we will conclude with a specification roadmap for EDA Agentic AI flows, which address the current shortcomings into the future.
ES02.2 TBD 14:30
TBD TBD¹

¹ TBD

TBD

LKS03 Later ? with the keynote speakers 14:00 - 15:00 | Buvette

Wrap up the lunchtime keynote presentations from yesterday and today with an open and informal Q&A. Meet the speakers, ask your questions, and take part in a lively conversation that continues the ideas sparked during the keynotes.

Chair: Marcello Traiola (Inria, FR)

Co-Chair: Cristiana Bolchini (Politecnico di Milano, IT)

Organizers:

Piedad Brox Jiménez (CSIC, )
Milos Krstic (IHP, DE)

LKS03.1 On My Perfect Life: A Tenured Position While AI Does The Job 14:00
Rolf Drechsler¹

¹ University of Bremen/DFKI

Artificial Intelligence is reshaping how we explore, create, and teach. Tasks once considered inherently human are now increasingly delegated to systems that appear to reason, learn, and even invent. This development challenges long-standing notions of expertise, authorship, and the role of human judgment in science and education. Through this development, AI already influences how research is conducted, published, and taught - often with remarkable efficiency, yet also with opaque mechanisms and uncertain reliability. The enthusiasm surrounding these tools must be balanced by a critical assessment of their capabilities and limitations. What AI systems produce can be impressive, but their results often lack genuine understanding or intent. The central question is how to preserve creativity, responsibility, and depth of insight in a world where machines can assist with almost everything - except understanding what they do.
LKS03.2 DEMOCRATIZING SILICON: THE RISE OF OPEN-SOURCE EDA AND EUROPE'S STRATEGIC ROADMAP 14:00
Luca Benini¹

¹ Università di Bologna and ETH Zurich

In recent years, the paradigm of open source has moved beyond software to transform the landscape of semiconductor design. Open-source Electronic Design Automation (EDA) tools offer a transformative path toward lowering barriers to entry, ensuring technological sovereignty, and accelerating engineering productivity. This talk provides a comprehensive overview of the current state of the open-source hardware ecosystem and its future potential. We will take a deep dive into Europe's strategic investments in these frameworks and explore how fostering a robust open-source toolchain is the essential catalyst for the next generation of open-source chips. Bio: Luca Benini holds the chair of digital Circuits and systems at ETHZ and is Full Professor at the Università di Bologna. He received a PhD from Stanford University. His research interests are in energy-efficient parallel computing systems, smart sensing micro-systems and machine learning hardware. He is a Fellow of the IEEE, of the ACM, a member of the Academia Europaea and a funding member of the Italian Academy of Engineering and Technology. He is the recipient of various awards, including 2024 IEEE CS Open Source Hardware contribution Award.

ET03 Open-Source Hardware Design: From High-Level Code to Silicon with Bambu and SODA 14:00 - 15:30 | Nabucco

ET03.1 Open-Source Hardware Design: From High-Level Code to Silicon with Bambu and SODA 14:00
Serena Curzel¹, Fabrizio Ferrandi¹, Nicolas Bohm Agostini², Antonino Tumeo²

¹ Politecnico di Milano ; ² Pacific Northwest National Laboratory

The open-source hardware design ecosystem has matured significantly in recent years, driven by a decentralized model that fosters collaboration and innovation. This evolution has enabled the integration of open-source Electronic Design Automation (EDA) tools, methodologies, and Process Design Kits (PDKs), supporting complete design flows for multiple technology nodes and diverse application domains. As modern applications ranging from AI to data analytics and beyond demand domain-specific accelerators to reach stringent performance and energy efficiency targets, open-source design environments are becoming essential to shorten design cycles, reduce costs, and foster innovation. This tutorial focuses on open-source hardware design flows for custom accelerators, emphasizing High-Level Synthesis (HLS) as a key enabler for rapid and reproducible hardware development. Participants will learn how modern open-source tools such as Bambu, SODA-OPT, and OpenROAD can be combined into a complete design flow from high-level (C/C++ or Python) kernels down to ASIC implementation. The session will demonstrate how open-source tools now support full circuit design flows at mature technology nodes, suitable for a wide range of application domains. Through live demonstrations and practical guidance, attendees will gain hands-on experience with an end-to-end open-source design flow for accelerator development. Beyond practical demonstrations, the tutorial will provide an overview of how open-source methodologies are transforming education, training, and research in hardware design. The session will highlight ongoing initiatives and funded projects that strengthen the open hardware ecosystem, foster collaboration, and define long-term priorities for its growth. The tutorial will conclude with a brief overview of current and future research directions, inspiring participants to explore and contribute to the next generation of open-source EDA tools.

W05 EMEC: Energy and Material Efficiency in Cloud-Edge continuum 14:00 - 18:00 | Figaro

TS38 Ultra-Reliable and Efficient Mixed-Criticality Systems 16:30 - 18:00 | Tosca

Chair: Angeliki Kritikakou (Univ Rennes, Inria, CNRS, IRISA, FR)

Co-Chair: Cristiana Bolchini (Politecnico di Milano, IT)

TS38.1 PREFACE: Proactive Re-executions for Fault-aware Mixed-criticality Environments 16:30
Hwisoo So¹, Byeonggil Jun², Chanhee Lee², Hokeun Kim², Aviral Shrivastava²

¹ Kyungpook National University ; ² Arizona State University

Mixed Criticality Systems (MCSs) enable efficient utilization of hardware resources to execute safety-critical tasks along with non-critical tasks. Soft errors are a critical threat to MCSs, causing detectable as well as undetectable errors. State-of-the-art fault-tolerant MCSs protect the safety-critical tasks against soft errors by reactively re-executing them upon detecting failures. However, assuming that all failures can be detected, existing state-of-the-art failure formulations for fault-tolerant MCSs fail to consider undetected failures. Further, the reactive re-execution strategy cannot improve fault tolerance against undetected failures. To address this problem, we propose PREFACE, Proactive Re-Executions for Fault-Aware mixed-Criticality Environments. PREFACE formulates the failure rates of MCS tasks by differentiating detectable failures from undetectable ones. Based on our novel failure formulation, PREFACE proactively re-executes a task even when no fault is detected to cope with potential undetectable failures, only when it is necessary. Our evaluation demonstrates that PREFACE dramatically improves the scheduling feasibility and reliability compared to state-of-the-art fault-tolerant MCSs.
TS38.2 Automatic Extraction of Timing Models for WCET Estimation From a High-Level Synthesis Flow 16:35
Thomas Feuilletin¹, Dylan Leothaud², Simon Rokicki³, Steven Derrien⁴, Isabelle Puaut¹

¹ Univ Rennes, Inria, CNRS, IRISA ; ² Univ Rennes, IRISA ; ³ Irisa ; ⁴ Université de Bretagne Occidentale/Lab-STICC

Real-time, domain-specific processors require faithful timing models for WCET analysis. However, existing models are typically hand-crafted from sparse documentation, making them error-prone and difficult to maintain. This work aims to automatically extract WCET timing models from single-issue in-order processor pipelines generated by High-Level Synthesis (HLS). By deriving timing models directly from the SpecHLS intermediate representation, the models are faithful by construction. Experimental results show that our timing-model extraction process generalizes across diverse RISC-V core variants and yields WCET estimates within 0.48% on average of those from a hand-crafted model, on the Mälardalen WCET benchmarks.
TS38.3 Simultaneous Multithreading and Common-Period Sporadic Tasks in Hard Real-Time 16:40
Sims Osborne¹

¹ Elon University

Simultaneous multithreading (SMT) can significantly improve hard real-time scheduling, but existing methods are limited to scenarios with pre-determined job release times. Here, a scheduling algorithm and polynomial-time schedulability test targeting sporadic, common-period systems is given. The challenge of using SMT here is that job costs are dependent on when other jobs release and are executed. The schedulability test given here uses the maximum-weight matching problem from graph theory to upper-bound the execution costs even given the worst possible release pattern. Schedulability studies show that with this algorithm, systems with utilizations exceeding 1.2 can be scheduled without deadline misses on a single core, a 20\% increase compared to the best case without SMT.
TS38.4 Efficient Warpage Simulation of Complex 2.5-D/3-D IC Structures with Novel Meshing Algorithm and Layerwise Plate Theory 16:45
Tianxiang Zhu¹, Qipan Wang¹, Yibo Lin¹, Runsheng Wang¹

¹ Peking University

Nowadays, warpage effect is becoming one of the main concerns in the manufacture of 2.5-D/3-D IC packages. Numerical simulation of warpage in the design stage by the finite element method (FEM) is required for manufacturability and reliability optimization. 2.5-D/3-D IC packages are generally composed of laminated thin plates with high aspect ratios and complex in-plane material boundaries, leading to intrinsic difficulties in obtaining high-quality hexahedral meshes essential for fast convergence and high-quality results. In this paper, we propose a novel meshing algorithm for efficient generation of sweep hexahedral meshes towards complex 2.5-D/3-D structures. On the basis of the sweep mesh, we utilize a modified 2-D layerwise plate theory to further improve the convergence of the solver. Compared with Ansys Workbench, our meshing algorithm can either reduce the meshing time (74.7X to 221X) and the number of mesh nodes (5.26X to 18.4X), or improve the mesh quality (3.45X to 9.75X) and reduce convergence time of the solver (1.48X to 4.50X), with <0.5% errors. A 3.75X to 12.6X reduction in convergence time is further achieved with the proposed 2-D layerwise plate theory compared to the 3-D formulation, while maintaining the errors within 3%.
TS38.5 Exploiting Variable-Dimensional LDPC Coding to Improve NAND Flash Memory System Performance 16:50
Meng Zhang¹, Wei Li¹, Yangyi Li¹, Tianwei Gui¹, Changsheng Xie¹, Fei Wu¹

¹ Huazhong University of Science and Technology

Solid state drives (SSDs) based on NAND flash technology are steadily gaining popularity and mass market adoption due to their increased storage capacity and density. However, because of the more bits in each cell and the reduced cell spacing, they are experiencing a decline in reliability. The most efficient way to ensure reliability of data is to use low-density parity-check (LDPC) codes. Nevertheless, using a hybrid decoding technique for LDPC codes results in a significant decoding latency, which exacerbates performance issues. In this paper, we propose a variable-dimensional LDPC coding scheme, called VDLDPC, to reduce the high decoding latency and thus improve read performance of NAND flash memory on hot read data. One of the crucial designs in the VDLDPC scheme is the two dimensional LDPC (TD-LDPC) algorithm. TD-LDPC implements row and column encoding separately when writing data to the flash memory by using sub-LDPC codes. Errors in the data arise after a period of retention. When the data is read out, TD-LDPC performs row and column decoding using sub-LDPC codes, and the column decoding result can be re-decoded as a new round of row decoding input. Simulation results show that the proposed VDLDPC scheme has the advantage in decoding latency and reduces the flash memory read response time by up to 12.0% (5.8% on average across all workloads) compared to the current LDPC code scheme. The proposed VDLDPC scheme ensures reliability while improving NAND flash system read performance on hot read data.
TS38.6 Topology-Aware Circuit Breaking on Critical Paths in Microservice Systems 16:55
Lin Wang¹, Xin Li¹, Yanling Bu¹, Tianhao Zhang¹, Meiyan Teng¹, Yanchao Zhao¹

¹ Nanjing University of Aeronautics and Astronautics

In microservice architectures, the complex web of inter-service dependencies makes systems vulnerable to cascading failures, where a single slow microservice can degrade overall application performance. Conventional circuit-breaking mechanisms often lack the precision to handle these issues effectively, as they treat services uniformly or depend on static, local thresholds. This can lead to either overly aggressive or delayed responses, resulting in inefficient system stabilization. This paper introduces the Topology-Aware Circuit Breaker (TACB), a traffic management mechanism that addresses this challenge by focusing on the services that matter most. The core idea of TACB is to dynamically identify the request's critical path, the longest execution path in the service call graph, which dictates the end-to-end latency. By concentrating its analysis and circuit-breaking actions exclusively on the services along this path, TACB intelligently ignores non-critical services and adapts to the real-time state of the distributed system. TACB continuously monitors service stability on the critical path and applies targeted circuit breaking to any service exhibiting performance degradation. This ensures that protective measures are applied precisely where and when they are needed. We implemented TACB on an Istio-based service mesh and evaluated it using the DeathStarBench benchmark suite. Experimental results demonstrate that our approach achieves significant improvements in system resilience, reducing end-to-end latency by over 50% and improving overall throughput compared to default and random strategies.
TS38.7 SSALDPC: A Syndrome-Sum Based Adaptive LDPC Decoding Scheme for NAND Flash Memory 17:00
Lanlan Cui¹, fei wu², Yunlong He³, Kun Jiang⁴, Yeqiu Xiao⁴, Renzhi Xiao⁵, Changsheng Xie²

¹ XI'AN University of Technology ; ² huazhong university of science and technology ; ³ Gnextech (Shanghai) Intelligent Information Technology Co., Ltd ; ⁴ Xi'an University of Technology ; ⁵ Jiangxi University of Science and Technology

As the storage density of 3D NAND flash memory continues to increase, the widespread adoption of multi-layer stacking and multi-level cell technologies has led to overlap and shift in threshold voltage distributions, significantly elevating the raw bit error rate (RBER) and posing serious challenges to data reliability. Although solutions based on low-density parity-check (LDPC) codes and read-retry scheme have become the standard approach to mitigate high RBER, the latency introduced by repeated read operations considerably degrades system read performance. This paper proposes a syndrome-sum-based adaptive LDPC decoding scheme (SSALDPC). Upon failure of initial hard-decision decoding, the scheme leverages the real-time syndrome sum (SS) generated during decoding to assess error severity and adaptively selects an appropriate subsequent decoding strategy from three modes: Efficiency-Mode (E-Mode), Balance-Mode (B-Mode), or Performance-Mode (P-Mode). Experimental results demonstrate that the proposed SSALDPC scheme reduces read-retry operations by over 20% and lowers decoding latency across a wide range of RBER conditions while maintaining strong error correction capability.
TS38.8 Preemption Threshold Assignment to Improve Schedulability under Memory Constraints 17:05
Thilanka Thilakasiri¹, Matthias Becker¹

¹ KTH Royal Institute of Technology

In this paper, we propose a novel preemption threshold assignment algorithm that considers both the memory limitation and schedulability, thereby improving both aspects as opposed to the state-of-the-art algorithms that only consider one of the two aspects. In addition, the proposed algorithm explores only a fraction of preemption threshold configurations in a shorter time compared to the state-of-the-art.

TS39 Approximate Computing Solutions for Deep Learning 16:30 - 18:00 | Traviata

Chair: Jie Han (University of Alberta, CA)

Co-Chair: Chang Meng (Eindhoven University of Technology, NL)

TS39.1 APEX: Integer-only Non-linear Function Approximation for Efficient Cross-Modal Inference 16:30
Peihuan Ni¹, Zitao Mo², Tielong Liu¹, Hongli Wen¹, Zeyu Zhu¹, Minnan Pei³, Junwen Si¹, Weifan Guan¹, Peisong Wang¹, Qinghao Hu¹, Gang Li¹, Jian Cheng¹

¹ Institute of Automation, Chinese Academy of Sciences ; ² casia ; ³ CASIA

The non-linear functions introduced to modern Transformers are crucial to enhance the model performance. However, their high numerical precision requirements pose significant challenges for efficient inference, especially on resource-constrained hardware. Existing approximation methods still suffer from considerable computational overhead and limited generalization due to their sensitivity to the statistical distribution of activations. This limitation becomes particularly pronounced when a non-linear approximation designed for one modality is directly applied to another, as it fails to accommodate their divergent activation behaviors. To overcome these issues, we propose APEX, an efficient integer-only non-linear approximation method designed for robustness and general applicability. APEX integrates the computational graphs of non-linear functions into a unified dataflow, and performs bit-level optimization through static bit allocation and adaptive bit-width pruning (ABP), a technique that dynamically adjusts operand precision on-chip to lower computation costs and prevent underflow. We further co-design a unified and adaptive hardware architecture that supports the above two operand bit-width reduction schemes, significantly reducing hardware cost while maintaining accuracy. Extensive experiments across diverse language and vision models demonstrate that APEX achieves state-of-the-art performance. It improves accuracy by up to 0.7% on language models, 1.3% on vision models and a minimal cost of accuracy in multi-modal models compared to prior works. Furthermore, our proposed architecture achieves improvements of 1.73-8.71x in area efficiency and 1.21-10.83x in power efficiency, respectively.
TS39.2 DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models 16:35
Kunming Shao¹, Liang Zhao², Jiangnan Yu¹, Zhipeng Liao³, Xiaomeng WANG¹, Yi Zou², Tim Cheng⁴, Chi Ying Tsui⁴

¹ The Hong Kong University of Science and Technology ; ² South China University of Technology ; ³ Westlake University ; ⁴ HKUST

Stochastic computing (SC) offers hardware simplicity but suffers from low throughput, while high-throughput Digital Computing-in-Memory (DCIM) is bottlenecked by costly adder logic for matrix-vector multiplication (MVM). To address this trade-off, this paper introduces a digital stochastic CIM (DS-CIM) architecture that achieves both high accuracy and efficiency. We implement signed multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation. Throughput is enhanced by replicating this low-cost circuit 64 times with only a 1x area increase. Our core strategy, a shared Pseudo Random Number Generator (PRNG) with 2D partitioning, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions. We also resolve the 1s saturation issue via stochastic process analysis and data remapping, significantly improving accuracy and resilience to input sparsity. Our high-accuracy DS-CIM1 variant achieves 94.45% accuracy for INT8 ResNet18 on CIFAR-10 with a root-mean-squared error (RMSE) of just 0.74%. Meanwhile, our high-efficiency DS-CIM2 variant attains an energy efficiency of 3566.1 TOPS/W and an area efficiency of 363.7 TOPS/mm2, while maintaining a low RMSE of 3.81%. The DS-CIM's capability with large-scale models is further demonstrated through experiments with INT8 ResNet50 on ImageNet and the FP8 LLaMA-7B model.
TS39.3 LUT-APP: Dynamic-Precision LUT-based Approximation Unifying Non-Linear Operations in Transformers 16:40
Seokkyu Yoon¹, NamJoon Kim², Hyun Kim¹

¹ Seoul National University of Science and Technology ; ² Seoul National University of Science and Technology, Electrical and Information Engineering

On-device transformer inference faces a growing bottleneck in which non-linear functions (e.g., exponential (EXP), reciprocal, reciprocal square root, GeLU, and SiLU) contribute significantly to inference latency as matrix operations become highly optimized. Existing approximation methods either rely on operator-specific datapaths with poor hardware reusability or exhibit a suboptimal accuracy-resource balance with conventional look-up table (LUT)-based piecewise linear approximation (PWL) under stringent edge constraints. This work presents LUT-APP, a unified dynamic-precision LUT-based PWL approximation framework that reconciles accuracy and hardware efficiency across diverse non-linear operators. First, a dynamic fixed-point format (DFF) adaptively allocates bit-width based on input magnitude and parameter scaling to handle the wide dynamic range of EXP. Second, a genetic adaptive differential evolution (GADE) algorithm synthesizes non-uniform PWL segments to minimize approximation error for a given LUT budget. Third, hardware-efficient DFF processing units enable a unified INT8 multiply-add datapath, allowing a single reusable implementation across functions. Experimental results demonstrate that LUT-APP reduces approximation error by up to 6.87× versus state-of-the-art methods while preserving baseline accuracy in large language models and vision transformers without fine-tuning. Hardware synthesis with a 28nm technology shows 4.19× lower area and 3.26× lower power savings than existing LUT-based PWL approaches, validating LUT-APP as a practical, resource-constrained solution for on-device accelerators. We provide the LUT-APP implementation at https://github.com/IDSL-SeoulTech/LUT-APP
TS39.4 Learning to Approximate: Circuit Learning and Deep Reinforcement Learning for Approximate Logic Synthesis with an Error Rate Guarantee 16:45
Chi-Wei Chen¹, Yi-Ting Li², Wuqian Tang¹, Yung-Chih Chen³, Jian-Meng Yang⁴, Chun-Yao Wang⁵

¹ Department of Computer Science, National Tsing Hua University ; ² National Tsing Hua University ; ³ Department of Electrical Engineering, National Taiwan University of Science and Technology; ARCULUS SYSTEM CO., LTD. ; ⁴ ARCULUS SYSTEM CO., LTD. ; ⁵ Department of Computer Science, National Tsing Hua University; ARCULUS SYSTEM CO., LTD.

Approximate computing is an emerging design paradigm for error-tolerant applications, such as multimedia processing and neural network acceleration, which enables significant reductions in circuit area, delay, or power consumption through controlled accuracy trade-offs. This paper presents a novel deep reinforcement learning (DRL)-based framework for approximate logic synthesis (ALS) augmented with a backtracking mechanism, aimed at minimizing the area–delay product (ADP) while satisfying error rate constraints. The experimental results demonstrate that our approach can reduce the ADP by up to 92.83%, and 56.79% on average under a 5% error rate constraint.
TS39.5 Voltage Aware Approximate CGRA Synthesis for Energy Efficient DNN Inference 16:50
Georgios Alexandris¹, Panagiotis Chaidos², Alexis Maras², Barry de Bruin³, Manil Dev Gomony⁴, Henk Corporaal⁵, Dimitrios Soudris², Sotirios Xydis¹

¹ National Technical University of Athens ; ² NTUA ; ³ Eindhoven University of Technology ; ⁴ Eindhoven Unversity of Technology ; ⁵ TU/e (Eindhoven University of Technology)

The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In this paper, we introduce a novel end-to-end exploration and synthesis framework for approximate CGRA processors, enabling the transparent and optimized integration and mapping of approximate multiplication components into CGRAs. Our framework includes an exploration of state-of-the-art approximate multiplication units on the hardware side, along with a software exploration, based on a per-channel model analysis, that maps specific output features onto approximate components based on accuracy degradation constraints, utilizing also SW-based optimization techniques. This enables the optimization of the system's energy consumption while retaining the accuracy above a certain threshold. At the circuit level, the integration of approximate components enables the creation of voltage islands that operate at reduced voltage levels, which is attributed to their inherently shorter critical paths. This key enabler allows us to effectively reduce the overall power consumption by an average of 30\% across our analyzed architectures, compared to their baseline counterparts, while incurring only a minimal 2\% area overhead. The proposed methodology was evaluated on the convolutional kernels of a widely used NN model, MobileNetV2, on the ImageNet dataset, demonstrating that the generated architectures can deliver up to 440 GOPS/W with relatively small output error during inference, outperforming several State-of-the-Art CGRA architectures in terms of throughput and energy efficiency.
TS39.6 Bridging the Power Estimation Gap: A GNN-Based Prediction Model for Approximate Logic Synthesis 16:55
Fuxuan Li¹, Ao Liu¹, Siting Liu², Hui Wang¹, Jie Han³, Honglan Jiang¹

¹ Shanghai Jiao Tong University ; ² ShanghaiTech University ; ³ University of Alberta

Approximate computing is an application-related paradigm that trades limited accuracy for improvements in hardware cost. As a key technique of approximate computing, approximate logic synthesis (ALS) automatically generates approximate circuits with reduced area, power, and delay while satisfying predefined quality-of-result (QoR) constraints. However, in typical gate-level ALS workflows, synthesis tools are invoked at the final stage for optimization, leading to a discrepancy between the circuit in design space exploration (DSE) and the final obtained circuit. Thus, the power estimation for a candidate circuit during DSE may exhibit a significant gap from the actual power consumed by its post-synthesis circuit. This gap may mislead the DSE to a sub-optimal design. To address this issue, we propose a graph neural network (GNN)-based power prediction model that operates on gate-level circuits. The model incorporates multi-head channel attention, which extracts high-level topological and functional features that correlate with power dissipation and implicitly captures the optimization behavior of synthesis tools. Thus, it enables a direct prediction of post-synthesis power from pre-synthesis gate-level circuits. Experimental results show that the proposed model improves the concordance index (C-index) for power ranking by up to 14.0% over traditional methods. Furthermore, we construct an ALS framework by integrating the proposed model with Cartesian genetic programming (CGP). Compared to state-of-the-art ALS approaches, our GNN-CGP framework generates circuits with up to 26.8% power savings under the same error constraints.
TS39.7 ARCSyn: Aging-Aware Accuracy-Reconfigurable Logic Synthesis 17:00
Ruicheng Dai¹, Feiyang Shu¹, Pengpeng Ren¹, Runsheng Wang², Weikang Qian¹

¹ Shanghai Jiao Tong University ; ² Peking University

As CMOS technology scales down, transistor aging has become a major threat to the long-term reliability of digital circuits. Existing solutions, such as aging-aware synthesis and approximate computing, suffer from either limited optimization space or early-stage accuracy loss. To address the above limitations, we propose ARCSyn, an aging-aware logic synthesis framework that generates accuracy-reconfigurable circuits capable of switching between accurate and approximate modes depending on aging conditions. Experimental results show that ARCSyn effectively extends circuit lifetime by 9.5 times while satisfying user-specified error constraints with only 3.72% area overhead.
TS39.8 Towards Input-Distribution-Aware Approximate Multiplier Generation for CNNs 17:01
Alessandro Buccolini¹, Marco Biasion¹, Rodrigo Otoni², George Constantinides³, Laura Pozzi¹

¹ USI Lugano ; ² University of Groningen ; ³ Imperial College London

Convolutional Neural Networks (CNNs) are widely used in vision-related tasks and require intensive computation, due to the large number of multiplications in their convolutional layers. Their inherent tolerance to small numerical perturbations makes them well-suited for approximate computing, which can significantly reduce circuit area and energy consumption while having a limited impact on accuracy. We present an approach for generating approximate multipliers tailored to CNN input distri- butions. By using multiple complementary constraints and inte- grating them into an SMT-based design framework, our method effectively explores the approximation design space, producing multipliers that achieve an effective accuracy–efficiency tradeoff. Compared to five state-of-the-art CNN-oriented design techniques, our approach reduces PDA (Power-Delay-Area product) by an average of 17.45% (up to 25.73%) at equivalent accuracy.

TS40 Efficient Simulation and Validation Methods 16:30 - 18:00 | Aida

Chair: Yangdi Lyu (Hong Kong University of Science and Technology, CN)

Co-Chair: Michele Lora (Università di Verona, IT)

TS40.1 CISim: ISA-Agnostic Custom Instruction Simulation for General-Purpose Processor 16:30
Xiaoyu Hao¹, Sen Zhang¹, Liang Qiao¹, Jun Shi¹, Junshi Chen¹, Hong An¹

¹ University of Science and Technology of China

Pre-RTL ISA-agnostic simulators have been established for designing heterogeneous systems, but few of them are suitable for evaluating a general-purpose processor (GPP) with custom instructions (CIs). MosaicSim, a state-of-the-art ISA-agnostic simulator, still has several limitations for CI design and simulation. First, it shows inaccuracy in simulating GPPs due to an oversimplified performance model. Second, as designed for kernel simulation, it lacks support for running complex real-world benchmarks. Third, it cannot evaluate fine-grained irregular CIs due to the lack of the ability to represent or define them in benchmarks. To this end, we propose CISim, a new ISA-agnostic simulation framework containing an offloader that generates and integrates CIs into benchmarks, along with a simulator capable of executing benchmarks with CIs. Evaluations show that CISim is accurate by validating against Gem5 and achieves higher accuracy than MosaicSim. A case study evaluating CI exploration methods highlights the strength and flexibility of CISim.
TS40.2 Machine Learning-Driven Early Performance Prediction Framework for Accelerated Microarchitecture Simulation 16:35
Aiden Stickney¹, Osvaldo Castro¹, Aaron Chan², Paul Gratz¹, Jiang Hu¹, Aakash Tyagi¹, Jered Dominguez-Trujillo³, Galen Shipman³, Kevin Sheridan³

¹ Texas A&M University ; ² Intel Corporation ; ³ Los Alamos National Laboratory

Rapid and accurate performance estimation is critical in evaluating novel microarchitectures, as it enables efficient exploration of architectural trade-offs. Unfortunately, traditional simulation techniques, while precise in predicting performance and power, incur tremendous slowdowns versus real machines. Despite prior works having explored machine learning–based performance prediction, the area remains far from sufficiently studied with existing approaches typically requiring large comprehensive datasets, frequent retraining, and heavy memory footprints with limited accuracy. Here, we introduce a new, fast and accurate, early-stage preview framework that uses partial simulation data, and leverages a smaller, faster tree-based machine learning (ML) model to forecast performance metrics such as IPC and Power. By training on a diverse set of configurations, our framework dynamically captures relationships between microarchitectural parameters in large OoO cores versus overall performance and other metrics. Collecting data from as few as 10 sample points taken during warmup, representing only 25 million instructions, our models achieve mean absolute percentage errors of 3-4%, preserving a majority of the model's predictive accuracy while achieving a 25X speedup (96% reduction in simulation time). By comparison, linear regression techniques from the same point in simulation show an error of 50%. In cache DSE, we improve ranking accuracy by 25X compared to state-of-the-art prediction methods. Our results also show the proposed framework can accurately predict the performance of unseen (untrained) microarchitectural components including new prefetchers and branch predictors.
TS40.3 VeriRepair: Toward Reliable LLM-Based RTL Repair via CoT-Supervised Multi-Objective Fine-Tuning and Hybrid Retrieval 16:40
Lei Peng¹, Aijiao Cui¹, Yier Jin²

¹ Harbin Institute of Technology (Shenzhen) ; ² University of Florida

Abstract—Ensuring the reliability of Register-Transfer Level (RTL) designs is critical, yet automated Verilog repair remains challenging due to requirements on synthesizability, timing correctness, and functional consistency. Existing LLM-based approaches rely on heuristic prompting, lack structured reasoning, and are trained on narrow datasets, which limits generalization and leads to logically inconsistent fixes. We present VeriRepair, the firsta novel framework to introduce Chain-of-Thought (CoT) supervision into hardware repair, combinedmulti-objective fine-tuning with a hybrid retrieval-augmented inference mechanism. We construct a 13k-pair RTL bug–fix dataset, covering more than 40 error types across six categories and enriched with reasoning annotations. The model is jointly fine-tuned on repaired code and reasoning traces, yielding more accurate and interpretable fixes. During inference, a hybrid retriever leverages semantic and structural similarity to guide patch generation. Experiments demonstrate that VeriRepair attains 76.6% Top-1 accuracy, surpassing VeriDebug by 20.1% and CirFix by 44.9%. Moreover, the framework is readily deployable in real industrial flows, integrating with pre-synthesis lint/fix pipelines and simulation- or UVM-based verification. The dataset is open source and available on GitHub: https://github.com/90ICEDA/verirepair. Keywords—Code Repair, LLMs, AST, Chain-of-Thought Reasoning, RAG
TS40.4 RAMP: RTL-Level Emulation with Thousand-Core-Scale Parallelism 16:45
weigang Feng¹, yijia Zhang¹, zekun Wang², zhengyang Wang³, yi Wang¹, peijun Ma², Ningyi Xu⁴

¹ Shanghai Jiao Tong University ; ² Xidian University ; ³ University of Toronto ; ⁴ Shanghai Jiaotong University

With the continuous increase in transistor counts on a single chip, the complexity of RTL verification has grown exponentially, and completing a full simulation flow often takes several months. In industrial practice, RTL simulation is typically divided into two stages: functional debugging and system verification. Functional debugging emphasizes fast compilation and is usually performed on multi-core CPUs, while system verification demands extremely high simulation speed and often relies on FPGA acceleration. However, the limited performance of CPU-based simulation has become a major bottleneck that restricts overall design productivity. To address this challenge, we propose RAMP, a scalable multi-core RTL simulation platform that balances fast compilation with high-throughput execution. RAMP leverages a specialized architecture and compilation strategy to accelerate both combinational logic evaluation and sequential logic synchronization. For combinational logic, it adopts a balanced DAG partitioning method together with highly efficient Boolean computation cores; for sequential logic, it employs a low-latency on-chip network (NoC) to achieve efficient state synchronization across cores. Experimental results demonstrate that RAMP achieves up to 12.9× speedup over state-of-the-art multi-core simulators.
TS40.5 GE-LLM: Graph-Enhanced Large Language Models for Efficient Transistor-Level Circuit Simulation 16:50
Chao Wang¹, Dan Niu¹, Yichao Dong², Dekang Zhang¹, Changyin Sun³, Zhou Jin⁴

¹ Southeast University ; ² Southeast university ; ³ Anhui University ; ⁴ Zhejiang University

DC analysis holds critical importance in nonlinear circuit simulation, providing the essential precondition for transient and AC analyses. While Pseudo-Transient Analysis (PTA) and its variants excel in DC analysis, selecting the optimal PTA method for specific circuits remains challenging. To address this, we propose GE-LLM, a novel framework for optimal PTA method selection, which integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs). The framework first converts circuit netlists into graph representations and employs a GNN-based graph encoder to capture essential circuit topologies. Subsequently, a novel text-graph alignment strategy bridges circuit topologies and textual descriptions, enabling the LLM to effectively comprehend multimodal information. Finally, we introduce a multi-perspective few-shot prompt that mitigates data scarcity by enabling effective in-context learning from limited circuit examples. Experimental results demonstrate that GE-LLM achieves a high selection accuracy of 0.9714 and improves the efficiency of DC analysis, yielding an average speedup of 2.89X in PTA steps (up to 12.14X) and 3.45X in Newton-Raphson iterations (up to 30.39X) compared to a commercial SPICE-like simulator.
TS40.6 SimFuzz: Similarity-guided Block-level Mutation for RISC-V Processor Fuzzing 16:55
Hao Lyu¹, Jingzheng Wu¹, Xiang Ling¹, Yicheng Zhong¹, Zhiyuan Li², Tianyue Luo¹

¹ Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences ; ² Institute of Software Chinese Academy of Sciences; University of Chinese Academy of Sciences

The Instruction Set Architecture (ISA) defines processor operations and serves as the interface between hardware and software. As an open ISA, RISC-V lowers the barriers to processor design and encourages widespread adoption, but also exposes processors to security risks such as functional bugs. Processor fuzzing is a powerful technique for automatically detecting these bugs. However, existing fuzzing methods suffer from two main limitations. First, their emphasis on redundant test case generation causes them to overlook cross-processor corner cases. Second, they rely too heavily on coverage guidance. Current coverage metrics are biased and inefficient, and become ineffective once coverage growth plateaus. To overcome these limitations, we propose SimFuzz, a fuzzing framework that constructs a high-quality seed corpus from historical bug-triggering inputs and employs similarity-guided, block-level mutation to efficiently explore the processor input space. By introducing instruction similarity, SimFuzz expands the input space around seeds while preserving control-flow structure, enabling deeper exploration without relying on coverage feedback. We evaluate SimFuzz on three widely used open-source RISC-V processors: Rocket, BOOM, and XiangShan, and discover 17 bugs in total, including 14 previously unknown issues, 7 of which have been assigned CVE identifiers. These bugs affect the decode and memory units, cause instruction and data errors, and can lead to kernel instability or system crashes. Experimental results show that SimFuzz achieves up to 73.22% multiplexer coverage on the high-quality seed corpus. Our findings highlight critical security bugs in mainstream RISC-V processors and offer actionable insights for improving functional verification.
TS40.7 Enhance Language Model-based Repair for Memory-related Vulnerabilities via Knowledge- and Semantic-guided Analysis 17:00
Hao Shen¹, Ming Hu², Yanxin Yang¹, Xiaofei Xie³, Mingsong Chen¹

¹ East China Normal University ; ² Singapore Management University ; ³ Singapore Management Univerisity

Memory-related vulnerabilities often result in system crashes and performance drops, imposing significant risks for embedded systems. Despite the potential of Language Models(LMs) in program repair, existing LM-based approaches struggle with these vulnerabilities due to two primary limitations: i) LMs do not possess adequate domain knowledge concerning program analysis and the characteristics of memory-related vulnerabilities, and ii) LMs face constraints in managing contexts as the size of programs increases. To address this issue, we introduce MVRepair, a novel lightweight Language Model (ℓLM)-driven framework built upon a domain-specific knowledge library that is developed through the examination of 7,935 real-world memory-related vulnerabilities. By using our proposed knowledge-based analysis strategy and semantic-guided segmentation mechanism, MVRepair can substantially enhance the LM's ability to repair programs with memory-related vulnerabilities. Comprehensive experimental results on 8,118 real-world memory-related vulnerabilities demonstrate that, compared with state-of-the-art LM-based approaches, MVRepair yields improvements of a minimum of 23.8% in EM, 31.9% in BLEU-4, and 16.7% in CodeBLEU.
TS40.8 Multicore Design Verification Directed by Reinforcement Learning 17:05
Luiz Pereira¹, Diego Meditsch¹, Guilherme Campos¹, Luiz Santos¹

¹ Federal University of Santa Catarina, Computer Science Department

A reinforcement learning agent requires proper state approximation when handling partially observable environments. Bounded sequences of action-observation pairs are generally employed to approximate the notion of state, but the choice of representations for actions and observations depends on the specific environment. When targeting the verification environment of a multicore design, observations should capture coherent shared-memory behavior, which can be modeled with order relations capturing properties of the execution of concurrent programs, such as the reads-from (RF) and the coherence (CO) relations. A recent work proposed the use of RF-signatures as observations. However, that relation only provides value observation from loads (all having effect limited to the local core consuming its value), but not from stores (each potentially having effect on all cores sharing the same block, due to the coherence protocol). That is why this paper proposes an agent-directed approach that relies on CO-signatures and combined CO/RF-signatures to improve verification. We evaluated variants of a directed test generator based on the DQN agent under distinct signatures choices for different verification tasks involving 16 and 32-core ARMv8 2-level MOESI designs.
TS40.9 Hetero-ChipletSim: Bridging Chiplet, Interconnect and Packaging Heterogeneity in Multi-Chiplet System Simulation 17:10
Xuguang Yuan¹, Jiangyuan Gu¹, Qidie Wu¹, Yang Hu¹, Shaojun Wei¹, Shouyi Yin¹

¹ Tsinghua University

With the end of Moore's Law, multi-chiplet systems have emerged as a promising solution featuring heterogeneity across chiplets, interconnects and packaging. Existing simulators lack support for such multi-level heterogeneity, making accurate architectural exploration difficult. We propose Hetero-ChipletSim (HCS), a simulation methodology that directly integrates heterogeneous chiplet models while incorporating die-to-die(D2D) interconnect and packaging effects, enabling fast and accurate evaluation of multi-chiplet systems. Sensitivity analysis provides insights into design trade-offs under heterogeneous integration.
TS40.10 CoverAssert: Iterative LLM Assertion Generation Driven by Functional Coverage via Syntax-Semantic Representations 17:11
Yonghao Wang¹, Yang Yin¹, Hongqin Lyu¹, Jiaxin Zhou², Zhiteng Chao³, Mingyu Shi⁴, Wenchao Ding⁵, Yunlin Du⁶, Jing Ye¹, Tiancheng Wang¹, Huawei Li¹

¹ Institute of Computing Technology, Chinese Academy of Sciences ; ² Beijing Normal University ; ³ State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences CASTEST Co., Ltd. ; ⁴ Nanjing University ; ⁵ BAC, Tencent PCG ; ⁶ The University of Newcastle

LLMs can generate SystemVerilog assertions (SVAs) from natural language specs, but single-pass outputs often lack functional coverage due to limited IC design understanding. We propose CoverAssert, an iterative framework that clusters semantic and AST-based structural features of assertions, maps them to specifications, and uses functional coverage feedback to guide LLMs in prioritizing uncovered points. Experiments on four open-source designs show that integrating CoverAssert with AssertLLM and Spec2Assertion improves average improvements of 9.57% in branch coverage, 9.64% in statement coverage, and 15.69% in toggle coverage.

TS41 Photonics, 3D integration, and more: bridging physics and performance 16:30 - 18:00 | Nabucco

Chair: Said Hamdioui (Tudelft, NL)

Co-Chair: Ahmed Hemani (KTH Royal Institute of Technology, SE)

TS41.1 SuperPhys-Net: A Physics-Informed Super-Resolution Electromagnetic Simulator for Nanophotonic Devices 16:30
Yiyang Su¹, Hao Chen², Guohao Dai¹, Yuzhe Ma¹, Yeyu Tong²

¹ The Hong Kong University of Science and Technology (Guangzhou) ; ² The Hong Kong University of Science and Technology (Guangzhou))

The rapid advancement of photonic integrated circuits is driving innovations in interconnect, computing, and sensing applications. This progress has led to the development of nanophotonic waveguide devices with complex geometries, offering greater design flexibility and a wider range of functional applications. However, electromagnetic (EM) simulation imposes a heavy computational burden during the design and validation phases. This significantly hampers design iteration speed and scalability. Although existing data-driven methods and physics-informed neural networks have shown promise for simpler structures, they fall short for highly complex geometries, limiting the automation of photonic device design. To address these issues, we present the SuperPhys-Net framework. This innovative approach enhances coarse-grid simulation results through super-resolution and integrates physical constraints to generate fine-grid solutions that adhere to physical laws. Our model demonstrates outstanding performance across complex nanophotonic waveguide devices with varying dimensions, achieving a 72.61% improvement in accuracy over current state-of-the-art models. Additionally, it reduces computational time by 76.09% compared to standard finite-difference frequency-domain solvers, all while maintaining exceptional accuracy across all scales.
TS41.2 Critical-Path-Centric 3D IC Placement for Timing Optimization 16:35
Sojung Park¹, Heechun Park¹

¹ Ulsan National Institute of Science and Technology (UNIST)

The vertical integration of 3D ICs introduces physical design complexities beyond those of conventional 2D ICs, especially in the multi-tier placement stage. Existing pseudo-3D design flows typically follow a sequential process of 2D placement followed by tier partitioning, which focuses on balancing cell distribution across tiers but largely overlooks timing-related factors. In this paper, we propose a timing-driven 3D IC placement framework that explicitly prioritizes critical paths throughout the placement process. Starting from an initial 2D placement, we apply an ILP-based tier assignment that encourages critical-path cells to reside on the same tier, reducing unnecessary vertical transitions that degrade timing. We then perform critical-path-aware planar refinement, where the locations of critical-path cells are iteratively adjusted toward path-centric targets, with displacement magnitudes determined by timing criticality. Finally, tier partitioning for non-critical cells is conducted using an enhanced cost function that minimizes interference with previously optimized critical paths, thereby preventing relocation during subsequent legalization. Experimental results demonstrate 13.0% improvement in worst negative slack (WNS) and 7.85% reduction in total negative slack (TNS), validating the effectiveness of our critical path-driven placement strategy.
TS41.3 SFQ-Based CJoin Gate Implementation for Ultra-Low-Power Brownian Logic Circuits 16:40
Soshi Takagi¹, Masamitsu Tanaka², Koji Inoue¹, Satoshi Kawakami¹

¹ Kyushu University ; ² Nagoya University

The increasing energy consumption of data centers has highlighted the urgent need for ultra-low-power computing solutions. Single Flux Quantum (SFQ) circuits, which utilize superconducting Josephson junctions, offer promising advantages in terms of speed and energy efficiency; however, their power consumption has been limited by the need for noise suppression mechanisms. To address this, we propose the practical SFQ Brownian Logic Circuits (SBLCs), which exploit thermal noise for stochastic signal propagation, significantly reducing static power consumption. Especially, we address the three challenges: susceptibility to manufacturing variations, the absence of an actual CJoin gate implementation, which is essential for processing, and a lack of demonstrated practical advantages. This paper evaluates the robustness of SBLCs to manufacturing variations, proposes the first SFQ-based CJoin gate implementation, and demonstrates a Ripple Carry Adder (RCA) with a 3167x improvement in energy efficiency compared to traditional SFQ and CMOS circuits, confirming the superiority of SBLCs for future computing systems.
TS41.4 Architecture, Design and Technology Co-optimization for 3D ICs with Advanced BSPDN Considering Power ＆Thermal Integrity Impact 16:45
Zhou Hu¹, Yang Haolan², Liu Xincheng³, Wang Linqiu¹, Xie Feifan¹, Wang Yumeng¹, Chen Zhuojun³, Zhang Zhiyong¹, Peng Lianmao¹, Chen Rongmei¹

¹ Peking University ; ² Xiangtan University ; ³ Hunan University

This paper presents a comprehensive power and thermal integrity analysis of a commercial IP based 7nm 3D CPU with a much larger SRAM area compared to its logic section. We systematically investigate the impact of different 3D stacking architectures—Memory-on-Logic (MoL) and Logic-on-Memory (LoM)—combined with both front-side and back-side power delivery networks (FSPDN/BSPDN). A key contribution is a novel lightweight IR drop modeling tool developed in-house, which enables supper fast and highly accurate power integrity estimation at early physical design stages—far before signoff—significantly reducing design iteration time caused by IR violations. This tool also fills a critical gap in commercial EDA support for advanced 3D integration and BSPDN evaluation. Using this tool alongside multi-physics thermal simulations, we compare four 3D design scenarios. Results show that the MoL architecture with BSPDN achieves an optimal balance between power and thermal integrity: it reduces worst-case IR drop in the logic die to just one-fourth of the 2D reference, and lowers peak temperature by over 15 °C compared to a LoM counterpart. Further improvements, 50% in IR drop decrease and 14°C temperature reduction, are attainable through TSV optimization and high-thermal-conductivity material integration. This study provides essential 3D architeture, design and technology cooptimization methodologies for future high-perfermance 3D CPUs of advanced technology nodes.
TS41.5 3D IC Thermal Management with BEOL Wafer-Scale Sputtered Vertical h-BN 16:50
Cesely Smith¹, Brandon Reese¹, Atharva Raut¹, Yu-Tao Yang², Jian-Gang Zhu¹, Tathagata Srimani¹

¹ Carnegie Mellon University ; ² MediaTek, Inc.

Ultra-dense three-dimensional integrated circuits (3D ICs) promise substantial benefits in energy efficiency, throughput, and compute density. However, the number of compute tiers is limited by thermal constraints, which are exacerbated by inter-layer dielectrics (ILD) with low thermal conductivities. We introduce ultra-thin, vertically textured hexagonal boron nitride (h-BN) as a thermal dielectric for 3D IC thermal management. A VLSI-scale RF-sputtering process is used to grow back-end-of- line (BEOL) vertical h-BN on 100 mm wafers at temperatures up to 400°C, yielding a through-plane thermal conductivity of 57 W/m·K. 3D thermal simulations show that this material enables nine high-power 3D-compute tiers (approximately 104 W/cm² per tier) while maintaining peak temperatures below 125°C without increasing footprint. Furthermore, orienting the h-BN vertically (high through-plane thermal conductivity) compared to conformal h-BN (high in-plane thermal conductivity) achieves the same nine-tier stack with only 30% inter-layer dielectric fill, whereas conformal h-BN requires full ILD replacement. These results position vertical h-BN as a BEOL-compatible and VLSI- manufacturable thermal dielectric to enable the next generation of ultra-dense monolithic 3D ICs
TS41.6 AgenticTCAD: A LLM-based Multi-Agent Framework for Automated TCAD Code Generation and Device Optimization 16:55
Guangxi Fan¹, Tianliang Ma¹, Xuguang Sun¹, Xun Wang¹, Kain Lu Low², Leilai Shao¹

¹ Shanghai Jiao Tong University ; ² Xi'an Jiaotong–Liverpool University

With the continued scaling of advanced technology nodes, the design–technology co-optimization (DTCO) paradigm has become increasingly critical, rendering efficient device design and optimization essential. In the domain of TCAD simulation, however, the scarcity of open-source resources hinders language models from generating valid TCAD code. To overcome this limitation, we construct an open-source TCAD dataset curated by experts and fine-tune a domain-specific model for TCAD code generation. Building on this foundation, we propose AgenticTCAD, a natural language–driven multi-agent framework that enables end-to-end automated device design and optimization. Validation on a 2 nm nanosheet FET (NS-FET) design shows that AgenticTCAD achieves the International Roadmap for Devices and Systems (IRDS)-2024 device specifications within 4.2 hours, whereas human experts required 7.1 days with commercial tools.
TS41.7 IncreMacro-3D: Incremental Macro Placement for Face-to-Face Stacked Memory-on-Logic 3D ICs 17:00
Lancheng Zou¹, Sing Sen YE¹, Shuo Yin¹, Yuan Pu¹, Jiaxi Jiang¹, Siting Liu², Yuxuan Zhao¹, Bei Yu¹

¹ The Chinese University of Hong Kong ; ² Huawei Hong Kong Research Center

Face-to-face stacked 3D ICs, such as memory-on-logic (MoL) architectures, have emerged as a promising solution to overcome the limitations of traditional 2D integration by offering enhanced performance, power efficiency, and density. Given the increasing design complexity of modern system-on-chips (SoCs), achieving high-quality macro placement is critical, as it plays a decisive role in determining the final performance, power, and area (PPA) metrics. However, existing RTL-to-GDS 3D physical design flows for MoL 3D ICs rely heavily on manual macro placement, which becomes increasingly challenging and time-consuming for modern SoCs with a vast number of macros. In this paper, we introduce an innovative macro placement algorithm, \text{IncreMacro-3D}, which employs graph neural network-based macro repartitioning and 3D macro position refinement, thereby facilitating subsequent steps in 3D physical design flow. The experimental results on several benchmark circuits demonstrate that the proposed approach can reduce the routed wirelength, worst negative slack (WNS), total negative slack (TNS), and total power consumption by 6.1%, 44.2%, 62.8%, and 0.6% compared to state-of-the-art analytical placer for MoL 3D ICs.
TS41.8 SCARLET: A Scalable OPCM-Based Accelerator for Transformer Inference with Tiled Crossbars 17:05
Sina Karimi¹, Guowei Yang¹, Carlos Rios Ocampo², Ajay Joshi¹, Ayse Coskun¹

¹ Boston University ; ² University of Maryland

While transformer-based large language models (LLMs) have achieved state-of-the-art performance on a wide range of natural language processing tasks, their massive computational demands, especially during inference, pose a significant challenge. Photonic accelerators offer a promising solution, but existing designs struggle with the precision, dynamism, and storage requirements of modern LLMs. This paper introduces SCARLET, a hybrid photonic architecture that addresses these limitations through two key components. First, we design a high-density optical phase-change memory (OPCM) crossbar for static matrix multiplications, achieving 5.6X higher bit density and 86.43% lower energy compared to previous OPCM crossbar designs. Second, we introduce an approximate photonic floating-point multiplier to handle dynamic matrix multiplications and quantization steps by approximating floating-point computations with weighted integer sums, thus, eliminating the need for frequent memory reprogramming. Our evaluation on models with up to 13 billion parameters demonstrates significant performance improvements, including up to 17.15X and 8.45X lower latency during prefill and generation phases, respectively.
TS41.9 Braid-ZNS: Leveraging Zone Random Write Area for Efficient In-Storage Compression on ZNS SSDs 17:10
Minkyu Choi¹, Joonseong Hwang², Minjin Park², Seokin Hong²

¹ Sungkyunkwan University (SKKU) ; ² Sungkyunkwan University

Zoned Namespace (ZNS) SSD is an emerging storage solution that reduces device-level garbage collection and write amplification. However, the sequential write constraint of ZNS SSDs poses a challenge to adopting in-storage compression, as data placement rules prevent compressed variable-length data from being packed into optimally sized chunks. In this paper, we propose Braid-ZNS, a novel in-storage compression framework that leverages the Zone Random Write Area (ZRWA) to avoid double reads during data compression on ZNS SSDs. By exploiting the ZRWA to enable temporary in-place updates, Braid-ZNS reorganizes compressed blocks in a size-aware manner and prevents cases where a single logical page is split into multiple fragments. Our evaluation shows Braid-ZNS improved compression efficiency by up to 47.0% and throughput by x2.24 compared to a state-of-the-art in-storage compression on ZNS SSDs.

FS10 Focus Session: Crypto-agility in the Age of Quantum Computers 16:30 - 18:00 | Auditorium

As NIST recently defined, ?Cryptographic (crypto) agility refers to the capabilities needed to replace and adapt cryptographic algorithms in protocols, applications, software, hardware, and infrastructures without interrupting the flow of a running system in order to achieve resiliency'?. Crypto-agility is a fundamental extra functional requirement for cryptographic systems, since they are exposed to state-of-the-art attacks, but also, possibly, to attacks that will materialize in the future. This last aspect is extremely relevant when the computation environment is rapidly changing, for instance because of novel attacks and threat coming up (such as the one caused by quantum computational power). This is exactly the case of devices and applications that use asymmetric cryptography, that will be outdated by the deployment of scalable quantum computers. This special session addresses the topic of crypto-agility focusing on the design, integration, and deployment of quantum resistant primitives, presenting tools to rapidly design efficient hardware primitives resistant against quantum computational power, and discussing design methodologies and interfaces to integrate these primitives with the rest of the system, maintaining agility while not affecting the performance. The topic is timely and of interest to the attendees of DATE. On the one hand, a variety of cryptosystems are demanded to be deployed in an environment that requires agility. On the other hand, it is crucial to exploit tools to efficiently design these cryptosystems and to interface them with the rest of the system. It is thus of crucial importance that designers of future systems are aware of all the facets of crypto-agility, and that they have full knowledge of specific architectures and designs that help to provide such extra functional requirements.

Chair: Mike Hutter (University of the Bundeswehr Munich, DE and PQShield, UK)

Co-Chair: Paolo Palmieri (University College Cork, IE)

Organizers:

Francesco Regazzoni (University of Amsterdam, NL and Università della Svizzera italiana, CH)
Elif Bilge Kavu (Barkhausen Institut and TU Dresden, DE)

FS10.1 LLM4PQC - Accurate and Efficient Synthesis of PQC Cores by Feedback-Driven LLMs 16:30
Buddhi Perera¹, Zeng Wang², Weihua Xiao¹, Mohammed Nabeel², Ozgur Sinanoglu³, Johann Knechtel⁴, Ramesh Karri⁵

¹ NYU Tandon School of Engineering ; ² New York University ; ³ NYU Abu Dhabi ; ⁴ New York University Abu Dhabi ; ⁵ NYU

The design of post-quantum cryptography (PQC) hardware is a complex and hierarchical process with many challenges. A primary bottleneck is the conversion of PQC reference codes from C to high-level synthesis (HLS) specifications, which requires extensive manual refactoring [1]--[3]. Another bottleneck is the scalability of synthesis for complex PQC primitives, including number theoretic transform (NTT) accelerators and wide memory interfaces. While large language models (LLMs) have shown remarkable results for coding in general-purpose languages like Python, coding for hardware design is more challenging; feedback-driven and agentic integration are key principles of successful state-of-the-art approaches. Here, we propose LLM4PQC, an LLM-based framework that refactors high-level PQC specifications and reference C codes into HLS-ready and synthesizable C code. Our framework generates and verifies the resulting RTL code. For correctness, we leverage a hierarchy of checks, covering fast C compilation and simulation as well as RTL simulation. Case studies on NIST PQC reference designs demonstrate a reduction in manual effort and accelerated design-space exploration compared to traditional flows. Overall, LLM4PQC provides a powerful and efficient pathway for synthesizing complex hardware accelerators.
FS10.2 AI-Driven Generation of Efficient Post-Quantum Hash and Hardware Primitives 17:00
Ethan Cornett¹, Rahul Magesh¹, Sharath Pendyala², Abdelrahman Elwan³, Giuseppe Manzoni⁴, Elif Bilge Kavun⁵, Aydin Aysu¹

¹ North Carolina State University ; ² NC State University ; ³ Barkhausen Institut, TU Dresden ; ⁴ Barkhausen Institut gGmbH ; ⁵ Barkhausen Institut & TU Dresden

As quantum computing threatens the foundations of today's public-key cryptography, the hardware design community faces an urgent need for crypto-agility, the rapid and reliable deployment of new PQC primitives across diverse platforms. Building on our prior exploration of AI-assisted optimal hardware generation for the "bottleneck" units of PQC algorithms (such as sampler), this work broadens the scope to encompass the automated design of PQC-relevant hash functions (e.g., SHA-3, SHAKE, and SPHINCS+ components) and their integration into agile hardware frameworks. We investigate the use of LLMs as autonomous hardware co-designers, capable of generating, refining, and validating synthesizable C/HLS and RTL code for cryptographic accelerators under strict area-delay and security constraints. Our design flow enables on-demand regeneration and adaptation of cryptographic modules as algorithmic standards evolve, which is an essential capability for future crypto-agile embedded systems. Through extensive benchmarking across model families (i.e., GPT-4, Claude, Gemini, Grok) and hardware targets (ASIC, FPGA), we quantify LLM-generated implementations in terms of performance, resource utilization, and correctness against NIST PQC portfolio and previous candidates. We further discuss key challenges in AI-driven hardware co-design, ranging from prompt stability and verification scalability to algorithmic bias, and outline a road map for achieving trustworthy, quantum-resilient crypto-agile systems.
FS10.3 Design of Efficient Interfaces for Crypto-agility 17:30
Francesco Regazzoni¹

¹ University of Amsterdam and Università della Svizzera italiana

Crypto-agility is a fundamental feature that security architectures and algorithms have to provide. It is crucial to ensure security, allowing algorithms to be updated thus withstand attacks not known at the deployment, and it is crucial for sustainability, extending the lifetime of devices. One of the key problems for providing crypto-agility is to design efficient interfaces that allow designers to integrate agile architectures with the rest of the system without affecting the performance. In fact, often, the interfaces are optimized on parameters such as key length, that are not to be fixed in agile architectures. This talk presents the efficient design of interfaces that, on the one hand, support changing parameters, on the other, ensure high performance and limited energy consumption.

ET01 3DIC Advanced Packaging, Test & SLM 16:30 - 18:00 | Bohème

ET01.1 3DIC Advanced Packaging, Test & SLM 16:30
Yervant Zorian¹, Sandeep Goel²

¹ Synopsys ; ² TSMC

Advancements in process technology have enabled the creation of chips with billions of transistors, significantly enhancing power and performance for high-performance computing (HPC) and AI applications. This complexity has spurred the development of various 3D integration and packaging techniques utilizing multi-die/chiplet-based designs. Advanced 3D integration technologies allow for the construction of multi-die systems, each offering specific advantages and trade-offs in terms of performance, application, and cost. Similar to traditional chips, all 3DICs must be rigorously tested for manufacturing defects. This includes Known-Good Die (KGD) testing before stacking, Known-Good-Stack (KGS) testing after stacking, final tests, and system-level tests. Furthermore, given the complexity of the stacking process, in-silicon monitoring solutions are necessary to continuously check silicon health during in-field operation. This tutorial offers an overview of the advanced packaging technologies and explores the associated test flow challenges. An example of how the 3Dblox open standard simplifies the description of a 3D stack, enabling interoperability between EDA tools and allowing various test optimizations, is presented. Additionally, it covers various Design-for-Test (DFT) schemes, sensors/monitors and embedded test & repair solutions to facilitate efficient testing across different packaging configurations.

MPP02 AI-based and Quantum Computing multi-partner projects 16:30 - 18:00 | Rigoletto

Chair: Luigi Capogrosso (Interdisciplinary Transformation University of Austria, AT)

MPP02.1 Optimize edge AI processing through innovative compilation techniques 16:30
Shreya Alladi¹, Alexandre Lopoukhine², Georgios Alexandris³, Andrea Nardi-Dei⁴, Ravikiran Ravindranath Reddy¹, Christos Lamprakos³, Panagiotis Chaidos³, Alexis Maras³, Alberto Ros¹, Tobias Grosser², Sotirios Xydis³, Dimitrios Soudris³, Marc Geilen⁴, Sander Stuijk⁴, Henk Corporaal⁵, Alexandra Jimborean¹

¹ University of Murcia ; ² University of Cambridge ; ³ National Technical University of Athens ; ⁴ Eindhoven University of Technology ; ⁵ TU/e (Eindhoven University of Technology)

Heterogeneous architectures became a compelling choice for edge processors executing complex DNN workloads, as they provide an ideal blend of openness, customization, energy-efficient heterogeneity, and scalable performance. Compiler optimization for DNNs on heterogeneous System-on-Chip (SoC) architectures however, must navigate complex hardware-software co-design, data movement minimization, aggressive parallelism exploitation, and advanced static/dynamic code transformations to deliver high performance and energy efficiency. This paper presents a novel compiler ecosystem for highly heterogeneous SoCs with multiple back-end targets, spanning from typical CPUs, to programmable RISC-V clusters and up to dedicated and reconfigurable accelerators. It puts together static analysis, optimization, and scheduling infrastructure to overcome the limitations of current state-of-the-art tools for heterogeneous edge AI processors. Our compilation pipeline introduces several innovative features: (1) an automatic end-to-end flow for RISC-V-based platforms, (2) efficient data layout remapping (reducing memory footprint by 35% on average) and recognition of complex ternary reductions for auto-vectorization, (3) code layout adaptation for hardware simplification, (4) a novel MLIR-based RISC-V backend supporting optimized matrix-multiplication micro-kernels that reach 90% of peak performance, (5) periodic scheduling capabilities for layer-fused CNNs, and (6) automated mapping and scheduling onto heterogeneous CGRA templates for advanced parallel kernel execution, delivering 33% higher energy efficiency than the scalar implementation and up to 3.6X higher performance. These advances enable hardware-aware compilation that reduces manual optimization effort, lowers energy consumption through memory and computation optimization, and minimizes memory footprint and data transfers.
MPP02.2 Multi-Partner Project: Quantum-Secure IoT-based Digital Manufacturing Pilot in QUBIP project 16:35
EROS CAMACHO RUIZ¹, Pablo Navarro-Torrero², Piedad Brox Jiménez³, Maria Chiara Molteni⁴, Alberto Battistello⁵, Davide Bellizia⁶, Agostino Sette⁶, Enrico Bisio⁷, Nicola Tuveri⁸, Enrico Bravi⁹, Francesco Vaccaro⁹, Grazia D'Onghia⁹, Andrea Vesco¹⁰

¹ Instituto de Microelectrónica de Sevilla ; ² Instituto de Microelectronica de Sevilla (IMSE-CNM), CSIC-US ; ³ CSIC ; ⁴ Università degli Studi di Milano ; ⁵ Security Pattern ; ⁶ Telsy ; ⁷ Smart Factory ; ⁸ Tampere University (TAU) ; ⁹ Politecnico di Torino (POLITO) ; ¹⁰ Cybersecurity Research Group, LINKS Foundation

Connectivity has become essential to modern manufacturing, but it also introduces new security challenges. Industrial IoT ecosystems rely on public-key cryptography to protect communications, firmware, and operational data. However, the emergence of quantum computing threatens to undermine these cryptographic foundations, exposing long-lived manufacturing systems to future attacks. This paper presents the Quantum-Secure IoT-based Digital Manufacturing Pilot, developed within the EU-funded QUBIP project, which investigates the integration of post-quantum cryptography (PQC) into IoT environments. The pilot aims to demonstrate how quantum resistant algorithms can be efficiently deployed across heterogeneous devices with limited resources to ensure data exchange and authentication. By combining software and hardware-based approaches, the proposed pilot provides a replicable model for PQC migration in digital manufacturing, ensuring long-term data integrity and resilience in the quantum era.
MPP02.3 Multi-Partner Project: dAIEDGE - A network of excellence for distributed, trustworthy, efficient and scalable AI at the Edge 16:40
Alain Pagani¹, Jose Cano², Haralampos-G. Stratigopoulos³, Aysajan Abidin⁴, Rashed Al Koutayni¹, Luca Benini⁵, Angelos Bilas⁶, Alessandro Capotondi⁷, Roberto Cavicchioli⁸, Brian Clerkin⁹, Oscar Deniz¹⁰, Margaux Divernois¹¹, Baptiste Dupertuis¹¹, Dorvan Favre¹¹, Giulio Gambardella⁹, Ander Garcia Gangoiti¹², Carlo Augusto Grazia⁸, Dominik Gunzel¹³, Jude Haris², Klodjan K. Hidri¹⁴, Maick Huguenin¹¹, Manal Jammal¹⁵, Paul Kling¹⁶, Christos Kozanitis¹⁴, Xavier Lessage¹⁷, Srikanth Mandapati¹³, Philippe Massonet¹⁷, Alfio Di Mauro¹⁸, Varesh Mishra⁴, Juan Odriozola¹², Javier Parra Dominguez¹⁵, Nuria Pazos¹¹, Viviane Potocnik¹⁹, Miguel de Prado²⁰, Rohit Prasad²¹, Spyridon Raptis¹⁶, Gregoire Rebstein¹¹, Ignacio Sanudo Olmedo²², Mohamed Selim¹, Chinmay Satish Shrivastav⁸

¹ German Research Center for Artificial Intelligence (DFKI) ; ² University of Glasgow ; ³ Sorbonne University, CNRS, LIP6 ; ⁴ KULeuven ; ⁵ Università di Bologna and ETH Zurich ; ⁶ FORTH and University of Crete ; ⁷ Universita' di Modena e Reggio Emilia ; ⁸ University of Modena and Reggio Emilia ; ⁹ Synopsys ; ¹⁰ UCLM ; ¹¹ HE-Arc / HES-SO ; ¹² Vicomtech ; ¹³ German Aerospace Center (DLR) ; ¹⁴ Institute of Computer Science Foundation for Research and Technology, Hellas (FORTH) ; ¹⁵ University of Salamanca ; ¹⁶ Sorbonne Université, CNRS, LIP6 ; ¹⁷ Centre d'Excellence en Technologies de l'Information et de la Communication (CETIC) ; ¹⁸ ETH Zurich ; ¹⁹ ETH Zürich ; ²⁰ VERSES ; ²¹ CEA-LETI ; ²² Hipert s.r.l.

The dAIEDGE Network of Excellence (NoE) seeks to strengthen and support the development of a dynamic European cutting-edge Artificial intelligence (AI) ecosystem under the umbrella of the European Lighthouse for AI, and to sustain the development of advanced AI. dAIEDGE fosters the exchange of ideas, concepts, and trends on cutting-edge next generation AI, creating links between ecosystem actors to help both the European Commission (EC) and the European Union (EU) and the peripheral AI constituency identify strategies for future developments in Europe. Our main objective is to advance Europe's innovation and technology base by developing a comprehensive policy and governance approach to AI in order for the EU to become a world leader in innovation in the data economy and its applications.
MPP02.4 Multi-Partner Project: A Holistic and Open-Source Approach to Efficient, Secure and Reliable AI Hardware Deployment in DI-EDAI 16:45
Georgios Sotiropoulos¹, Felix Frombach¹, Julian Hoefer¹, Tanja Harbaum², Juergen Becker³, Henrik Thoroe¹, Vincent Meyers¹, Mehdi Tahoori¹, Zeynep Demirdag⁴, Mohammed Bakr Sikal⁵, Hassan Nassar⁶, Heba Khdr⁴, Joerg Henkel², Christopher Wolters⁷, Philipp van Kempen⁷, Johannes Geier⁷, Ulf Schlichtmann⁷, Batuhan Sesli⁸, Muhammad Sabih⁸, Jakob Wittmann⁸, Frank Hannig⁸, Juergen Teich⁸, Lukas Steiner⁹, Norbert Wehn⁹, Mohamed Ali⁹, Philipp Schmitz⁹, Wolfgang Kunz⁹, Stefan Koegler⁷, Georg Sigl¹⁰

¹ Karlsruhe Institute of Technology ; ² KIT ; ³ Karlsruhe Institute of Technology - ITIV ; ⁴ Karlsruhe Institute of Technology (KIT) ; ⁵ Chair for Embedded Systems, Karlsruhe Institute of Technology ; ⁶ Karlsruher Institut für Technologie ; ⁷ Technical University of Munich ; ⁸ Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) ; ⁹ University of Kaiserslautern ; ¹⁰ Technical University of Munich/Fraunhofer AISEC

Artificial Intelligence (AI) has demonstrated strong capabilities across various domains over the past decade. Edge and specifically mission-critical applications, such as automotive and aerospace, require both high performance and efficiency without compromises in security and reliability. This stems from tightly constrained power consumption, failures that can have catastrophic consequences and devices that may be physically accessible to malicious actors. AI algorithm deployment to hardware also presents significant barriers, requiring specialized knowledge and expensive development tools. The DI-EDAI project aims to offer a holistic approach for connecting high-level AI algorithms with hardware implementations while tackling the aforementioned issues. Unlike other approaches that address individual aspects of the AI deployment flow, we investigate solutions across multiple layers of the design stack. Through our work we develop efficient hardware, map AI algorithms to hardware while simultaneously ensuring security and reliability. Furthermore, we leverage AI-techniques to assist with Electronic Design Automation (EDA) workflows for design optimization, verification and implementation. Our open source approach aims to reduce entry barriers, promote transparency and education and spark innovation. This paper presents the current state of the DI-EDAI project at mid-term, highlighting our latest contributions, identifying limitations in existing state-of-the-art approaches, and outlining ongoing work to address these gaps.
MPP02.5 Multi-Partner Project: Enhancing Resilience, Efficiency, and Trustworthiness of Edge AI in Safety-Critical Systems (GuardAI) 16:50
Antonis Savva¹, Mehmet Demirel¹, Yeshwanth Adimoolam², Rafaella Elia¹, Alexandros Gkillas³, Christos Anagnostopoulos³, Erion-Vasilis Pikoulis³, Amalia Damianou⁴, Charmaine Barker⁵, Daniel Bethell⁶, Ahmed Ibrahim⁷, Filippo Cugini⁷, Francesco Paolucci⁷, Kyriakos Vlachos⁷, Simos Gerasimou⁸, Antonios Lalas⁴, Konstantinos Votis⁴, Aris Lalos⁹, Christos Kyrkou¹⁰, Theocharis Theocharides²

¹ KIOS Research and Innovation Centre of Excellence, University of Cyprus ; ² University of Cyprus ; ³ Industrial Systems Institute, ATHENA R.C. ; ⁴ Centre for Research and Technology, Hellas ; ⁵ UniversityofYork ; ⁶ Department of Computer Science, University of York ; ⁷ National Inter-University Consortium for Telecommunications ; ⁸ University of York ; ⁹ Industrial Systems Institute, ATHENA Research Center ; ¹⁰ KIOS CoE, University of Cyprus

AI at the network edge promises real-time perception and decision-making in safety-critical domains such as aerial robotics, autonomous vehicles, and 5G-enabled infrastructures. Yet, operating under resource constraints, dynamic, and adversarial conditions exposes edge AI systems to fragility, inefficiency, and security risks that threaten their safe operation. GuardAI, a Horizon Europe project, introduces a framework for resilient and trustworthy edge AI that unites three pillars: adversarial robustness, context-enhanced inference, and security-by-design. Initial project results include a diffusion-based adversarial purification framework optimized for real-time operation, lightweight deep unrolling architectures for LiDAR super-resolution with built-in outlier removal, and robust uncertainty quantification modules to improve confidence calibration. It further develops a context-enhanced inference engine that integrates visual, spatial, and operational context across multi-agent systems, and a risk- aware defense recommender that autonomously selects mitigation strategies based on evolving threat landscapes. Through representative Use Cases, covering monitoring with Unmanned Aerial Vehicle, decentralized 5G network analytics, and secure perception in connected autonomous vehicles, GuardAI demonstrates how robust and adaptive AI can be achieved within stringent edge constraints. Together, these technologies lay the groundwork for a new generation of secure, context-aware, and certifiable AI systems that can be trusted to operate autonomously in the physical world.
MPP02.6 Multi-Partner Project: Scalable, Ferroelectric-based Accelerators for Energy Efficient Edge AI (Ferro4EdgeAI) 16:55
Theofilos Spyrou¹, Yashvardhan Biyani¹, Konstantinos Stavrakakis¹, Rajendra Bishnoi², Said Hamdioui¹, Joel Minguet Lopez³, Louise Dumas⁴, Jean Coignus⁴, Denys Ly⁴, Hugo Chazot-Ranquet⁴, Laurent Grenouillet⁴, Fabien Grimaud⁴, Simon Martin⁴, Olivier BILLOINT⁵, Francois Andrieu⁴, Ruben Alcala⁶, Stefan Slesazeck⁷, Athira Sunil⁶, Chong Peng⁶, Antoine Cauquil⁸, Rosario Pronsato⁹, Damien Deleruyelle⁹, Cedric Marchand¹⁰, Alberto Bosio¹¹, Ian O'Connor⁹, Giulio Urlini¹², Simon Jeannot¹², Mohammad Sajedi Alvar¹³, Nima Akbari Moghaddam¹³, Thilo Werner¹³, Tony Schenk¹³, Bojun Cheng¹⁴, Mina Khoei¹⁴, Lucia Perez Ramirez¹⁵, Eunjin Koh¹⁵, Somnath Kale¹⁵, Nicholas Barrett¹⁵

¹ Delft University of Technology ; ² Delft University of Technology, ; ³ CEA Leti, Univ. Grenoble Alpes ; ⁴ CEA-Leti, Univ. Grenoble Alpes ; ⁵ CEA, LETI ; ⁶ NaMLab GmbH ; ⁷ NaMLab gGmbH ; ⁸ ECL ; ⁹ Lyon Institute of Nanotechnology ; ¹⁰ Ecole centrale Lyon ; ¹¹ Ecole Centrale de Lyon ; ¹² STMicroelectronics ; ¹³ Ferroelectric Memory GmbH ; ¹⁴ SynSense AG ; ¹⁵ SPEC, CEA, Université Paris-Saclay

The Computing-In-Memory (CIM) paradigm offers a promising solution to the memory-wall bottleneck that limits conventional Von Neumann architectures. By performing data processing at the same physical location where the data are stored, CIM-based architectures minimize costly data movement and drastically improve energy efficiency. When implemented with Ferroelectric Field Effect Transistors (FeFETs), additional advantages from the non-volatility, fast switching, and low operating voltage of FeFETs are added. However, the widespread adoption of FeFETs is limited by their poor endurance, which is overcome by a Back End of the Line (BEoL) integration of FeFET-2, where a ferroelectric capacitor (FeCAP) is wired to the gate of a CMOS transistor providing high endurance compatible with low-power edge applications. These properties enable dense, low-power, and high-speed matrix operations essential for AI workloads. As a result, FeFET-2-based CIM accelerators offer a promising solution for energy-efficient, high-performance AI at the edge. The Ferro4EdgeAI project aims to develop an ultra lowpower, scalable edge accelerator for AI, targeting a significant gain in energy efficiency with respect to state-of-the-art AI hardware accelerators. To attain this, our project focuses on innovation all along the value chain from materials, physic concepts, device architecture, integration technologies, and accelerators in a holistic design space exploration approach.
MPP02.7 Multi-Partner Project: Outcomes of the ICSC Flagship 2 Project on Architectures and Design Methodologies to Accelerate AI Workloads 17:00
Cristina Silvano¹, Fabrizio Ferrandi¹, Serena Curzel¹, Daniele Ielmini¹, Cristian Zambelli², Sebastiano Schifano³, Francesco Conti⁴, Angelo Garofalo⁵, Luca Benini⁶, Maurizio Palesi⁷, Giuseppe Ascia⁷, Enrico Russo⁷, Fanny Spagnolo⁸, Pasquale Corsonello⁹, Stefania Perri⁹, Fabio Frustaci⁹

¹ Politecnico di Milano ; ² University of Ferrara ; ³ Univeristy of Ferrara ; ⁴ University of Bologna ; ⁵ University of Bologna, ETH Zurich ; ⁶ Università di Bologna and ETH Zurich ; ⁷ University of Catania ; ⁸ Univerisity of Calabria - DIMEG ; ⁹ University of Calabria - DIMEG

Energy-efficient hardware accelerators specialized for AI tasks are now being deployed from low-power edge devices to large-scale high-performance computing systems and datacenters. This paper presents the main outcomes of the Flagship 2 project of the ICSC Italian National Research Center for High Performance Computing, which focuses on the design techniques for heterogeneous hardware optimized for AI acceleration from the edge to the HPC. In particular, we describe the main challenges addressed and highlight some advances in architectures, technologies, and design methodologies tailored to accelerate deep learning, transformer-based, and generative AI models. We also summarize the most significant outcomes achieved through the close collaboration among the project partners, including the development of design techniques, tools, prototypes, IP cores, and models that collectively advance AI acceleration from the edge to the HPC contexts.
MPP02.8 Multi-Partner Project: Efficient Deep Learning Platforms for Next-Generation Embedded Edge-AI Systems 17:05
Rajendra Bishnoi¹, Mohammad Amin Yaldagard², Konstantinos Stavrakakis², Said Hamdioui², Kanishkan Vadivel³, Pankaj Upadhyay³, Nicolas Rodriguez⁴, Teresa van Dam⁵, Sander Steeghs-Turchina⁵, Agathe Archet⁶, Prathamesh Deshpande⁷, Giovanni Grandi⁸, Hana Krichene⁹, William Fabre⁹, Fabian Chersi⁹

¹ Delft University of Technology, ; ² Delft University of Technology ; ³ IMEC Netherlands ; ⁴ Silicon Austria Labs ; ⁵ Almende B.V. ; ⁶ Thales Research & Technology ; ⁷ Infineon Technologies AG ; ⁸ Codasip GmbH ; ⁹ CEA-LIST

The objective of our collaborative multi-partner project is to create an open-source Deep Learning framework called AIDGE for edge and embedded Artificial Intelligence (AI), built around an established European value chain. The framework is designed to support diverse application domains that function independently while serving a broad international community. It offers an integrated, full-stack workflow from Neural Network design and optimization to AI application development and hardware-level implementation with automated code generation for specific hardware targets. The platform aims to provide researchers and developers with a flexible environment to explore novel AI concepts, rapidly prototype solutions, and ensure strong alignment between academic research and industrial requirements. This paper summarizes the progress, outcomes, and milestones achieved up to the second year of this three-year project.

CC Closing Ceremony 18:00 - 18:30 | Auditorium

Chair: Valeria Bertacco (University of Michigan, US)

Co-Chair: Alberto Bosio (École Centrale de Lyon, FR)

CC.01 Closing Remarks 18:00
Valeria Bertacco¹, Alberto Bosio²

¹ University of Michigan ; ² Ecole Centrale de Lyon
CC.02 SAVE THE DATE 2027 18:15
Robert Wille¹

¹ Technical University of Munich
Heterogeneous 3D Architecture and Co-Design Algorithms for Generative AI Workloads
Pratyush Dhingra¹, Vibhanshu Sharma¹, Janardhan Doppa¹, Partha Pande¹

¹ Washington State University

Effective acceleration of Generative AI workloads is constrained by the heterogeneous computational and memory characteristics. Large Language Models (LLMs), for example, are composed of diverse kernels, ranging from memory bound attention mechanisms to compute intensive feed forward networks. Such variability leads to poor resource utilization and renders homogeneous architectures inefficient. This paper addresses this challenge using two synergistic innovations. First, we present the design of a three-dimensional (3D) manycore architecture enabled by heterogeneous integration to address these bottlenecks. Specifically, the architecture leverages emerging non-volatile memory alongside CMOS-based computing cores to create an optimized kernel-to-compute mapping. Second, a hardware and software co-design framework is developed to enable fine-tuning of unimodal and multimodal LLMs. The co-design ensures thermally robust inference when leveraging emerging non-volatile memories within the heterogeneous architecture. Experimental results on multiple LLMs demonstrate the efficacy of the 3D heterogeneous architecture and the co-design framework.
Risks Introduced by AI in Hardware Security
Lejla Batina¹

¹ Radboud University Nijmegen

This session will examine the risks introduced by AI in hardware security workflows, including potential pitfalls in AI-assisted verification and fuzzing. The speaker will also discuss counter-measures for safe adoption and explore collaborative directions to integrate security validation into standard hardware design and verification pipelines, highlighting practical strategies for building trustworthy and resilient hardware systems.
Focus Session Paper: The MQT Compiler Collection – A Blueprint for a Future-Proof Quantum-Classical Compilation Framework
Lukas Burgholzer¹, Daniel Haag¹, Yannick Stade¹, Damian Rovara¹, Patrick Hopf¹, Robert Wille¹

¹ Technical University of Munich

As the capabilities of quantum computing hardware continue to rise, algorithms that exploit them are becoming increasingly complex. These developments increase the need for sophisticated compilation frameworks that translate high-level algorithms into executable code. In the past, most solutions were built with a quantum-first approach and handled mostly pure quantum programs without classical elements such as structured control flow. However, developments in quantum algorithms, error correction, and optimization, as well as the integration into high-performance computing (HPC) environments, depend on such classical elements. As quantum-first approaches increasingly struggle to handle these concepts, classical-first approaches are becoming a promising alternative. In this work, we present the MQT Compiler Collection, a blueprint for a future-proof quantum-classical compilation framework built on the Multi-Level Intermediate Representation (MLIR). After years of experience with the quantum-first approach and its shortcomings, we propose a framework that embraces core MLIR concepts to support the full compilation pipeline from high-level algorithms to hardware-specific instructions. The proposed architecture is designed from the ground up to support complex optimizations beyond, e.g., simple gate cancellation. It is publicly available at github.com/munich-quantum-toolkit/core.
Focus Session: Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and Certification
Behnaz Ranjbar¹, Kirankumar Raveendiran², Sudeep Pasricha², Samarjit Chakraborty³, Cecilia Carbonelli⁴, Akash Kumar¹

¹ Ruhr University Bochum ; ² Colorado State University ; ³ UNC Chapel Hill ; ⁴ Infineon Technologies

The design of embedded safety-critical systems such as those used in next-generation automotive and autonomous platforms, is increasingly challenged by escalating system complexity, hardware–software heterogeneity, and the integration of intelligent, data-driven components. Ensuring dependability in such systems requires a holistic approach that spans multiple abstraction layers and encompasses both design- and run-time assurance. Traditional methods for reliability, safety, and security management often fall short in addressing the dynamic and uncertain behaviors introduced by Artificial Intelligence (AI) and Machine Learning (ML) components, especially under stringent real-time, power, and safety constraints. While AI and ML offer powerful predictive, adaptive, and self-optimizing capabilities that can enhance system dependability, their inherent non-determinism, data-dependence, and lack of formal guarantees introduce new challenges for verification, validation, and certification. This paper explores emerging methodologies, architectures, and frameworks for designing dependable autonomous and embedded systems in the era of AI. It highlight advances in reliability modeling, secure system design, and certification approaches that account for imperfect, learning-enabled components, aiming to bridge the gap between AI innovation and certifiable system-level dependability.
Focus Session: What the Fuzz! Pushing Beyond Randomness in Hardware Security with Generative AI
Nikhilesh Singh¹, Mohamadreza Rostami², Lichao Wu³, Chen Chen⁴, Stephen Muttathil⁴, Jeyavijayan Rajendran⁴, Ahmad-Reza Sadeghi⁵

¹ Technische Universität Darmstadt ; ² Technical University of Darmstadt ; ³ TU Darmstadt ; ⁴ Texas A&M University ; ⁵ Technische Universitaet Darmstadt

Security is a foundational requirement for modern hardware, as design vulnerabilities can undermine entire computing systems from firmware to applications. However, traditional validation techniques often fall short as processor designs grow more complex, with rich privilege mechanisms and heterogeneous components. Hardware fuzzing offers a scalable way to probe implementations by generating large volumes of tests and using execution feedback to expose unintended behavior. However, conventional fuzzing pipelines often struggle to reach deep, security-critical corner cases efficiently due to limited semantic awareness and the high cost of hardware execution. In this work, we investigate the effectiveness of augmenting hardware fuzzing pipelines with generative AI to address these limitations. Specifically, we explore the feasibility of AI models to (a) generate syntactically valid, semantically meaningful instruction sequences rather than relying on random mutations, (b) incorporate execution feedback to guide exploration toward deep hardware states, and (c) prioritize promising tests to reduce simulation time. Finally, we highlight open challenges and out-line potential future research directions to strengthen hardware security through more intelligent, scalable validation techniques.
Focus Session: LLM4PQC – Accurate and Efficient Synthesis of PQC Cores by Feedback-Driven LLMs
Buddhi Perera¹, Zeng Wang¹, Weihua Xiao², Mohammed Nabeel², Ozgur Sinanoglu³, Johann Knechtel³, Ramesh Karri¹

¹ NYU ; ² New York University ; ³ New York University Abu Dhabi

The design of post-quantum cryptography (PQC) hardware is a complex and hierarchical process with many chal- lenges. A primary bottleneck is the conversion of PQC reference codes from C to high-level synthesis (HLS) specifications, which re- quires extensive manual refactoring [1]–[3]. Another bottleneck is the scalability of synthesis for complex PQC primitives, including number theoretic transform (NTT) accelerators and wide memory interfaces. While large language models (LLMs) have shown remarkable results for coding in general-purpose languages like Python, coding for hardware design is more challenging; feedback- driven and agentic integration are key principles of successful state-of-the-art approaches. Here, we propose LLM4PQC, an LLM-based framework that refactors high-level PQC specifica- tions and reference C codes into HLS-ready and synthesizable C code. Our framework generates and verifies the resulting RTL code. For correctness, we leverage a hierarchy of checks, covering fast C compilation and simulation as well as RTL simulation. Case studies on NIST PQC reference designs demonstrate a reduction in manual effort and accelerated design-space exploration compared to traditional flows. Overall, LLM4PQC provides a powerful and efficient pathway for synthesizing complex hardware accelerators.
Focus Session: Advanced Hybrid Hardware Fuzzing
Chen Chen¹, Stephen Muttathil¹, Mohamadreza Rostami², Nikhilesh Singh², Lichao Wu², Ahmad-Reza Sadeghi², Jeyavijayan Rajendran¹

¹ Texas A&M University ; ² Technische Universitat Darmstadt

Modern processors are increasingly complex, with rich microarchitectural features and heterogeneous components. This complexity expands the attack surface and makes security vulnerabilities harder to detect using traditional security techniques. Hardware fuzzing has emerged as a scalable approach for uncovering insecure behaviors in modern processors. However, it often struggles to (i) explore hard-to-reach design spaces due to its randomness and (ii) locate the root causes of vulnerabilities due to design complexity. This work presents advanced hybrid hardware fuzzing techniques that combine the complementary strengths of fuzzing, formal verification, and static analysis to systematically detect and localize vulnerabilities in processors. Specifically, we investigate (i) the use of formal verification to guide fuzzing toward hard-to-reach design spaces, thereby enabling the discovery of subtle vulnerabilities, and (ii) the use of static analysis to extract and monitor timing behaviors at the register-transfer level (RTL), enabling localization of timing vulnerabilities that can arise even in functionally correct designs. Finally, we outline future research directions, including using large language models to generate expert-informed tests, leveraging prior design knowledge to enhance fuzzing effectiveness on new processors, and transferring effective strategies from white-box fuzzing to black-box fuzzing environments.
Focus Session: Advanced CMOS and 5.5D Packaging: Perspectives and Challenges for Design, Reliability and Security
Hussam Amrouch¹, Giorgio Di Natale², Jerome Toublanc³, Dragomir Milojevic⁴

¹ Technical University of Munich (TUM) ; ² TIMA - CNRS ; ³ ANSYS ; ⁴ IMEC

AI workloads are driving exceptional demand for performance and energy efficiency, forcing semiconductor inno- vation to advance along two major directions simultaneously. On the device roadmap, the transition from FinFETs to gate- all-around nanosheet FETs and, subsequently, monolithic 3D Complementary FETs (CFETs) is enabling scaling toward the 2nm era and beyond while targeting aggressive logic density. In parallel, advanced packaging, spanning 2.5D integration on silicon interposers, true 3D stacking, and hybrid 5.5D assemblies, is becoming essential to deliver ultra-high bandwidth, low-energy die-to-die connectivity required by rapidly growing AI model sizes and the resulting memory-wall bottlenecks. This focus session discusses the opportunities and challenges of this co-evolution, with emphasis on system-technology co-optimization and the inevitable need for multiphysics analysis across electrical, thermal, mechanical, and reliability domains. We highlight how reliability and security concerns are increasingly shaping architectural and packaging choices, and we discuss the role of deep learning as a practical enabler for faster simulation and design-space exploration under rising complexity.