-
YPP04.1
Design of Energy-Efficient, High-Throughput Reconfigurable Systems through Cross-Layer Approximation and In-Network Computing
18:30
Zahra Ebrahimi1,
Akash Kumar2
1 Technische Universität Dresden
; 2 Ruhr University Bochum
As technology scaling plateaus, sustaining high performance and energy-efficiency requires alternative design strategies. Modern signal-processing and machine-learning workloads across the edge-to-cloud continuum exhibit a natural tolerance to noise and computational imprecision, enabling the relaxation of strict accuracy constraints. This trend has fueled the rise of cross-layer approximation, where optimizations are applied across circuit, architecture, and application layers to extract additional performance. At the same time, the increasing scale and data intensity of emerging applications expose the limitations of traditional edge-cloud computing, particularly in meeting real-time, throughput, and energy-efficiency requirements. These challenges have motivated the shift toward in-network computing (INC), which exploits programmable network devices to process data in transit.
Despite this need, existing approximation and INC techniques fall short: prior works often overlook INC constraints, target ASICs rather than reconfigurable fabrics, focus primarily on adders and multipliers while neglecting costly division, operate in non-pipelined SISD styles that limit throughput, omit architectural-level optimizations, and lack general frameworks for tuning approximation knobs in multi-kernel applications.
In response, this thesis introduces a family of cross-layer approximation architectures tailored for reconfigurable FPGA- and CGRA-based fabrics. New approximate multipliers and dividers featuring fine-grained pipelining and SIMD execution models are proposed, along with a hybrid SIMD multiplier-divider that supports runtime functional and precision versatility without reconfiguration. These designs are incorporated into heterogeneous SIMD/MIMD CGRAs, establishing a unified cross-layer hierarchy.
Building on this foundation, a cross-layer methodology is developed that applies kernel-wise sensitivity analysis to capture performance-QoR trade-offs, followed by a greedy heuristic that tunes approximation knobs to maximize performance under user-defined QoR constraints.
Finally, the approach is extended to distributed and INC environments, enabling in-network acceleration of multi-kernel applications. By restructuring computations for programmable switches, the method reduces resource usage and communication overhead while improving real-time responsiveness, aligning with key 5G/6G performance targets.
-
YPP04.2
Efficient Compilation for Coarse-Grained Reconfigurable Arrays
18:30
Cristian Tirelli1,
Laura Pozzi1
1 USI Lugano
Coarse Grain Reconfigurable Arrays (CGRAs) are programmable accelerators that offer a compelling balance between flexibility and energy efficiency, making them attractive for compute intensive workloads under tight power and resource constraints. Their effectiveness, however, depends critically on the quality of the mapping, meaning how operations are placed on processing elements and scheduled in time. Existing compilation approaches largely rely on heuristics, which only explore a small portion of the mapping space and can miss high quality solutions, especially as architectures and workloads grow in size.
My thesis work advances CGRA compilation through two formal mapping strategies. The first contribution is a satisfiability based formulation of the mapping problem, which uses a solver to explore the complete mapping space and produce high quality placements, schedules, and routes. The second contribution addresses scalability by separating the temporal and spatial components of the problem, first computing a schedule in time and then constructing a spatial mapping through a graph monomorphism search. Experiments across a range of CGRA sizes show that, together, these methods preserve mapping quality while reducing compilation time by orders of magnitude, making exact and formally grounded CGRA compilation practical for larger architectures and more complex workloads.
-
YPP04.3
A Study of Performance Optimization Techniques of Digital Integrated Circuit Test Generation System
18:30
Zhiteng Chao1
1 State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences CASTEST Co., Ltd.
The Electronic Design Automation (EDA) for digital integrated circuits covers major design steps such as logic synthesis, test synthesis, and physical design. The test synthesis EDA tool needs to support Design For Testability (DFT) and Automatic Test Pattern Generation (ATPG) to ensure comprehensive detection of circuit defects after manufacturing. The ATPG problem is NP-complete, and the increasing scale of digital integrated circuit design presents significant challenges to ATPG systems: testability of complex circuits decreases, fault coverage becomes difficult to guarantee; the number of test patterns grows dramatically, leading to high test costs; test generation time can range from days to weeks, becoming a bottleneck in the iterative design process. Improving the efficiency of test generation while ensuring fault coverage and compressing test data has become a critical research direction for test synthesis EDA technology.
-
YPP04.4
Towards Performance-driven Analog Layout Design
18:30
Peng Xu1,
Bei Yu1
1 The Chinese University of Hong Kong
Although automated analog layout synthesis tools exist, they often struggle to accurately model performance, especially in the presence of layout-induced parasitics and device mismatch.
To overcome this limitation, this work proposes a unified, performance-driven analog place-and-route (PNR) methodology that tightly integrates optimization with machine learning.
The framework first introduces an unsupervised, physics-guided circuit representation learning scheme.
It then introduces a multi-objective analog placement engine with a differentiable performance predictor and a performance-driven routing framework that learns effective routing guidance.
-
YPP04.5
From Theory to Silicon: Mapping Nested Loops to Tightly Coupled Processor Arrays (TCPAs)
18:30
Dominik Walter1,
Jürgen Teich1
1 FAU
The increasing demand for computational performance has driven the development of specialized accelerators.
A perfect match for a huge class of computationally intensive problems characterized by nested loop programs with affine data dependencies, such as those common in linear algebra, signal processing, and image processing, exists in the form of massively parallel arrays of tiny processing elements that process and communicate data locally.
This work investigates such a class of processor arrays called Tightly Coupled Processor Arrays (TCPAs).
For their programming, TCPA use a formal model of loop nests called Piecewise Regular Algorithms (PRAs) that decouples the semantics of algorithms from their low-level implementation.
This allows for an iteration-centric mapping paradigm that exploits parallelism at the level of loop iterations.
We show that this approach achieves up to 19x speedup over an operation-centric coarse-grained reconfigurable array.
In addition, we propose a zero-overhead loop control mechanism and a loop-aware memory subsystem that schedules all required data transfers between external and on-chip memories in real time.
These concepts are validated through FPGA and ASIC prototypes, including a 22 nm ALPACA chip delivering 537.6 GFLOPs at 700 MHz.
Overall, these results demonstrate that TCPAs offer an efficient, scalable, and formally sound architecture for accelerating nested loops.
-
YPP04.6
Towards Ultra-reliable Automotive Systems-on-Chip
18:30
Giusy Iaria1,
Paolo Bernardi1
1 Politecnico di Torino
Modern automotive devices must meet increasingly advanced functionality by incorporating millions logic gates and scan flip-flops. The increasing complexity of these devices is putting a strain on traditional manufacturing testing methodologies, which although they succeed in ensuring the required high reliability standards, need new techniques to better deal with the growing complexity. This work details several proposed methodologies that aim to optimize the testing process of all life stages--starting from Wafer Sort and ending in the In-Field phase--of an industrial automotive device. The contributions aim at guaranteeing high reliability despite the increasing complexity of modern devices.
Experimental results have been carried out using automotive devices and manufacturing data provided by STMicroelectronics, demonstrating practical applicability in industrial environments.
-
YPP04.7
Hardware Acceleration with Backdoor and Side-Channel Analysis for Lattice-based Cryptography
18:30
Suraj Mandal1,
Debapriya Basu Roy2
1 IIT Kanpur
; 2 Indian Institute of Technology Kanpur
Due to recent growth in the development of quantum computers, post-quantum cryptographic (PQC) algorithms have become the centre of attention, replacing the classical cryptographic algorithms like ECC/RSA, as these are vulnerable to quantum attacks. Among the PQC algorithms, CRYSTALS-
Kyber (FIPS-203) and CRYSTALS-Dilithium(FIPS-204) are the two standardized algorithms recommended by NIST to be used as a quantum secure key encapsulation algorithm (KEM) and quantum secure digital signature algorithm (DSA), respectively. Both CRYSTALS-Kyber and CRYSTALS-Dilithium are module lattice (ML) -based algorithms that need large polynomial multiplication as well as several hash operations, which make the software implementations of these algorithms very slow. In this work, we have designed low-area and low-latency dedicated hardware accelerators for both these algorithms. Our proposed hardware accelerators have been tested and implemented on FPGAs. Furthermore, our proposed polynomial multipliers using Number Theoretic Transformation (NTT) can also be used for homomorphic encryption schemes and lattice-based algorithms. We have also performed the security analysis, including side-channel analysis and backdoor attacks on both of these algorithms
-
YPP04.8
Multi-objective Full-Process Design Space Exploration for Chiplet Heterogeneous Integration
18:30
Shixin Chen1
1 The Chinese University of Hong Kong
Chiplet heterogeneous integration has emerged as a pivotal solution to circumvent Moore's Law limitations, enabling flexible scaling, heterogeneous integration, and high component reusability for high-performance computing systems.
However, its design process is plagued by four intertwined challenges that hinder optimal tape-out.
First, the design space expands exponentially with parameters spanning die partitioning, interconnection topology, microarchitecture configuration, and packaging options, leading to inefficient blind exploration.
Second, conflicting multi-objective requirements (performance, power, area, cost, and reliability) demand careful trade-offs, as metrics like power consumption correlate with thermal distribution and area directly impacts manufacturing cost.
Third, the inherent trade-off between simulation speed and accuracy persists, where high-precision simulations are computationally prohibitive, while fast simulations lack sufficient fidelity or only cover partial metrics.
Fourth, cross-layer mismatches between architectural design and physical implementation (\textit{e.g.}, advanced packaging-induced latency, floorplan-induced communication overhead) often render theoretical performance gains unachievable in practice.
Addressing these challenges requires not only isolated optimizations but a holistic framework that integrates application guidance, efficient simulation, multi-objective optimization, and physical implementation—ensuring both exploration efficiency and final design quality.
-
YPP04.9
Near-Data Processing of Structured and Semi-structured NoSQL Data on FPGA-based SoCs
18:30
Tobias Hahn1,
Jürgen Teich1
1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
The exponential growth of diverse data formats has created new challenges for efficient data processing systems. Contemporary NoSQL data formats such as JSON, Apache Avro, and Apache Parquet provide substantial benefits including rapid schema evolution, enhanced query performance, reduced storage footprint, and improved version compatibility, yet remain poorly optimized for traditional processing systems designed primarily for structured relational data. This work summarizes approaches to near-data processing of structured and semi-structured NoSQL documents on FPGA-based Systems-on-Chip (SoCs). A suite of specialized hardware accelerators is presented that efficiently handle the parsing, preprocessing, and transformation of complex NoSQL data formats, demonstrating significant performance improvements over conventional CPU-based solutions. The proposed accelerators are designed for minimal resource utilization while achieving high throughput, making them well-suited for integration into heterogeneous computing architectures such as SmartSSDs and SmartNICs. This enables efficient near-data processing setups that significantly reduce data movement overhead and improve overall system performance for NoSQL workloads.
-
YPP04.10
HW-SW co-design techniques for DT-based Ensemble models on embedded systems
18:30
Alessandro Verosimile1,
Marco D. Santambrogio1
1 Politecnico di Milano
The growing adoption of Artificial Intelligence of Things (AIoT) systems demands embedded architectures capable of executing complex Machine Learning (ML) workloads under strict constraints on latency, throughput, and resource utilization. Decision Tree (DT) Ensembles, such as Random Forests (RFs), remain state-of-the-art for heterogeneous tabular data and offer inherent parallelism suitable for hardware acceleration. However, their efficient deployment on resource-constrained devices requires dedicated co-optimization of both the model and its hardware architecture. This PhD project presents HW–SW co-design methodologies that unify training and architectural development to produce models that are simultaneously accurate and hardware-compliant. Building upon prior memory-centric designs, the proposed flow introduces a novel multi-pipeline design exploiting both vertical and horizontal parallelism. Thanks to optimized resource utilization strategies, heterogeneous-depth training, and automated spatial-architecture acceleration mechanisms, the resulting architecture significantly increases the number of DTs that can be supported on embedded platforms while significantly reducing inference latency, achieving up to a 46.8× improvement over state-of-the-art solutions with negligible accuracy loss. Lastly, the framework has been extended to perform the inference of Oblique Random Forests (ORFs). Thanks to a hardware-aware training procedure and to a novel circulant memory mapping scheme, this approach significantly improves accuracy and latency compared to traditional axis-aligned Random-Forest accelerators.
-
YPP04.11
AI-Driven Topology Synthesis and System-level Optimization of Analog/Mixed-Signal Circuits
18:30
Jiaqi Wang1,
Georges Gielen2
1 KU Leuven
; 2 MICAS-KU Leuven
Analog/Mixed-Signal (AMS) circuit design remains a significant bottleneck in the development of modern electronic systems. While digital design automation has matured significantly, analog design is still largely manual, relying heavily on the intuition of expert designers to navigate vast combinatorial search spaces . As system requirements for bandwidth and power efficiency become more demanding, traditional optimization methods (such as genetic algorithms) struggle with sample efficiency, and manual topology selection becomes increasingly intractable.
My PhD research addresses these challenges by developing a comprehensive Artificial Intelligence (AI) framework for the system-level automation of AMS circuits. The work bridges the gap between optimization and synthesis through two core methodologies:
Graph-Guided Reinforcement Learning (RL): Utilizing Graph Attention Networks (GATs) to optimize circuit parameters and enable knowledge transfer between dissimilar architectures.
Generative AI for Topology Synthesis: Leveraging Large Language Models (LLMs) to automate the structural synthesis of circuits.
-
YPP04.12
High-Level Synthesis of speculative circuits
18:30
Dylan Leothaud1
1 Univ Rennes, IRISA
State-of-the-art High-Level Synthesis (HLS) tools are an easy way to fast-prototype and automatically generate hardware from a C or C++ behavioral description. Circuits generated with HLS tools may underperform due to static scheduling, which determines when each operation will be executed, but uses only compile-time information and assumes the worst case. To overcome the limitations of static scheduling, we develop SpecHLS, a speculative HLS tool that extends a state-of-the-art HLS tool to insert speculation mechanisms into the generated circuits. SpecHLS lacked key features studied during my PhD. (1) How and where to speculate in the generated circuit? (2) How to reduce the area overhead induced by the speculation mechanisms? (3) How to take resource conflict into account in the speculative circuits?
-
YPP04.13
Real-Time Emotion Recognition: Deep Physiological Feature Learning and Adaptive Edge Intelligence
18:30
Junjiao Sun1
1 Centro de Electrónica Industrial Universidad Politecnica de Madrid
Affective computing seeks to equip intelligent systems with the ability to perceive and respond to human emotions. Physiological signals acquired from wearable sensors—such as BVP, EDA/GSR, and SKT—offer a non-intrusive and privacy-preserving source of information for real-time affect monitoring. However, accurate emotion recognition from physiological data remains challenging due to cross-user variability, limited labeled datasets, and the computational and privacy constraints inherent to edge devices.
This doctoral research develops a deep learning framework for non-intrusive physiological emotion recognition, with emphasis on structured feature representations, temporal modeling, personalization, and resource-efficient deployment. First, multi-channel physiological signals are transformed into two-dimensional feature maps, enabling the use of image-based neural architectures that better capture spatial and inter-signal relationships than traditional handcrafted approaches. Building on this representation, a hybrid CNN–LSTM model is proposed to jointly learn spatial descriptors and temporal dynamics, improving robustness and accuracy under real-world constraints.
To address cold-start personalization—where new users provide no labeled data yet exhibit unique physiological patterns—this work introduces CLEAR, an adaptive clustering and fine-tuning method that enables immediate personalized inference with minimal supervision while remaining fully compatible with embedded hardware. Finally, CHEER, a self-supervised and transfer-efficient cloud-to-edge framework, is developed to overcome data scarcity and hardware limitations. CHEER pre-trains lightweight GCN-based encoders using unlabeled cloud data, assigns new users through clustering, and trains only a compact on-device classifier with a few labeled samples, significantly reducing computation, memory, and energy usage while keeping raw data on-device—an essential property when dealing with sensitive physiological information.
-
YPP04.14
Co-Design of Lightweight Intrusion Detection Systems for IoT Devices.
18:30
Qingyu Zeng1,
Yuko Hara2
1 Institute of Science Tokyo
; 2 CNRS
The proliferation of Internet-of-Things (IoT) devices in smart homes and cyber-physical systems enlarges the attack surface and calls for intrusion detection that can run directly on resource-constrained edge devices. Existing intrusion detection systems (IDSs) are typically designed for servers and rely on heavyweight models and large memories, which makes them difficult to deploy in realistic IoT environments. This work targets the algorithm/architecture co-design of lightweight IDSs and IoT edge platforms. On the algorithm side, we design a compact feature representation and a lightweight detection model that can operate within tight memory and computation budgets. On the architecture side, we implement and optimize the runtime on embedded IoT platforms while preserving runtime performance and minimal computing overhead.
-
YPP04.15
Towards Autonomous Analog IC Design: Sample-Efficient, PVT-Robust, and Explainable Sizing via Agentic AI
18:30
Mohsen Ahmadzadeh1,
Georges Gielen1
1 KU Leuven
Analog and Mixed-Signal (AMS) circuit sizing remains a critical bottleneck in modern System-on-Chip design. As technology scales to advanced FinFET nodes, the manual effort required to satisfy strict performance constraints across Process, Voltage, and Temperature (PVT) variations becomes prohibitive. While Reinforcement Learning (RL) has emerged as a promising automation technique, current methods suffer from two fundamental flaws: extreme sample inefficiency and a "black-box" nature that prevents designer trust.
This research presents a holistic framework to resolve these challenges through two novel contributions. First, to address efficiency and robustness, we introduce AnaCraft, a "Duel-Play" adversarial RL algorithm. Unlike standard cooperative methods, AnaCraft trains sizing agents to compete against an adversarial PVT agent that actively searches for worst-case corners. Coupled with probabilistic Model-Based Policy Optimization (MBPO), this approach achieves fully PVT-robust designs with ~3x fewer simulations than state-of-the-art methods, as validated on a complex 7nm data receiver.
Second, to bridge the interpretability gap, we present AnaFlow, the first agentic Large Language Model (LLM) workflow for analog sizing. By orchestrating specialized agents, AnaFlow mimics the cognitive workflow of a human expert. It performs iterative refinement using text-based reasoning rather than opaque numerical gradients. Experimental results demonstrate that AnaFlow matches the optimality of RL baselines while providing transparent, human-readable justifications for design trade-offs. Together, these works pave the way for autonomous design agents that are both mathematically robust and intelligible to engineers.
-
YPP04.16
Multi-objective Quantitative Modeling for Design Space Exploration on AI processors
18:30
Jiacong Sun1
1 KU Leuven
The rapid evolution of AI algorithms, combined with the slow design cycles of hardware, creates a critical need for fast and accurate design space exploration tools for AI accelerators. This work presents an analytical simulation framework that enables multi-objective design space exploration across a wide range of hardware architectures and workloads. The framework is structured around three interrelated research directions. First, we develop objective cost models that quantify performance, carbon footprint, and sparsity-aware behavior, validated against silicon measurements with less than 7% mismatch. Second, we develop a dataflow mapper that finds optimal scheduling for both deterministic and uncertainty-aware sparsity scenarios. Third, we develop an algorithm abstraction frontend that generalizes across diverse workloads—from CNNs and Transformers to emerging models such as diffusion models and Ising machines. All components are released as open-source tools.
-
YPP04.17
Exploiting Neuromorphic Computing for Effective, Efficient, and Secure Edge-Cloud Collaborative Computing System
18:30
Haomin Li1,
Li Jiang1,
Fangxin Liu2
1 Shanghai Jiao Tong University
; 2 Shanghai Jiaotong University
This research proposes an edge-cloud collaborative framework leveraging neuromorphic computing to address challenges in computational efficiency, scalability, and security for edge-cloud systems. The framework integrates four key components: (1) an efficient neuromorphic learning method, (2) a federated learning framework tailored for neuromorphic models, (3) robust security mechanisms and architectures for edge and cloud, and (4) an efficient neuromorphic computing framework optimized for edge devices.
These components work synergistically: the federated learning framework orchestrates edge-cloud collaboration, the efficient neuromorphic model ensures high performance with low computational cost in the cloud, the computational framework optimizes edge inference, and advanced cryptographic and hardware-based protections secure the system.
-
YPP04.18
Design Tools for Adiabatic Superconducting Logic Circuits Toward Energy-Efficient Computing
18:30
Rongliang Fu1
1 The Chinese University of Hong Kong
Adiabatic Quantum-Flux-Parametron (AQFP) logic is a highly energy-efficient superconductor logic family that reduces dynamic energy dissipation through adiabatic switching driven by AC excitation currents. It overcomes the power limitations of traditional logic families like rapid-single-flux-quantum (RSFQ) and efficiently implements majority functions. However, its unique features, such as fan-out limitations and clock-synchronized data propagation, make circuit design more complex. Addressing these challenges requires advanced design automation. I have contributed to developing EDA tools for AQFP logic, focusing on logic synthesis, placement, and routing, to streamline circuit design, optimize layouts, and reduce energy consumption, moving AQFP logic closer to practical application.
-
YPP04.19
Agentic on Cybersecurity and LLM-Aided EDA: Benchmarks and Automated Systems
18:30
Minghao Shao1
1 Graduate Research Assistant
This thesis investigates how modern AI systems, particularly domain-specialized agentic systems and large language models (LLMs), can be designed, evaluated, and deployed across diverse domains. Although areas such as cybersecurity, electronic design automation (EDA), quantum computing, and machine learning security differ in their technical requirements, they share common challenges including the need for domain-aware datasets, reliable benchmarks, automated reasoning pipelines, and robust evaluation frameworks. Motivated by these shared challenges, this Ph.D. research advances AI system applications in four sections: (1) cybersecurity agentic systems, (2) AI-aided EDA, (3) LLM for quantum code generation, and (4) machine learning security covering jailbreak attacks and deepfake detection.
-
YPP04.20
Bridging the Gap Between Fine-Grained Architectures and Large Systolic Arrays
18:30
Andrea Belano1,
Francesco Conti1
1 University of Bologna
The wide variety of architectures found in modern neural networks and their rapid evolution highlight the need for a platform that is both performant with large matrices and capable of efficiently parallelizing smaller, distinct tasks.
For these reasons, we propose NAUSICAA (Neural Acceleration Unit for Scalable Integration and Configurable Adaptive Architecture), a flexible multi-tile architecture that combines the flexibility of a granular architecture with the high arithmetic intensity of large systolic arrays.
Through a hierarchical design coupling RISC-V cores with dedicated systolic arrays and a private memory, NAUSICAA enables near-zero-overhead pipelining of linear and non-linear layers. Furthermore, the local multicast network of each tile allows the hardware to dynamically adapt to the workload, delivering consistently high performance and efficiency across different kernels.
-
YPP04.21
Design Technology Co-Optimization of Emerging Storage Class Memories
18:30
Bowen Wang1,
Wim Dehaene2
1 imec & KU Leuven
; 2 KU Leuven
This thesis advances emerging memory technologies to overcome the energy and density limitations of conventional memories and to enable their efficient integration into modern computing systems. The research pursues two primary objectives: 1) developing compact modeling and design methodologies that bring non-volatile memories into standard very-large-scale integration (VLSI) circuit and computer architecture design flows; 2) demonstrating low-power, high-density memory solutions for bandwidth-intensive artificial intelligence (AI) accelerators. These objectives are addressed through two core researches that structure the main technical chapters of this dissertation.
The first research enables a comprehensive array-level evaluation of power, performance, and area (PPA) within a design-technology co-optimization (DTCO) framework, for voltage-controlled magnetic anisotropy (VCMA) magnetoresistive random-access memory (MRAM).
The second research investigates a three-dimensional (3D) indium gallium zinc oxide (IGZO) charge-coupled device (CCD) block memory as an on-chip storage solution for AI accelerators, with its PPA and architectural benefits evaluated using both DTCO and system-technology co-optimization (STCO) methodologies.
Together, these contributions provide a unified design-enablement flow and a novel memory architecture, addressing key challenges in energy efficiency, density, bandwidth, and cross-layer technology-to-system integration for future high-performance and energy-efficient AI systems.
-
YPP04.22
Model and System Co-design for Distributed DNN Inference on Edge Heterogeneous Devices
18:30
Mingyu Hu1,
Amit Kumar Singh2,
Jonathon Hare1,
Geoff Merrett1
1 University of Southampton
; 2 University of Essex
The growing deployment of deep neural networks on edge and IoT platforms requires efficient and reliable inference in heterogeneous, resource-constrained, and time-varying edge environments. Distributed DNN inference within local edge networks can preserve full model functionality, but it also introduces challenges in workload partitioning, communication overhead, and robustness to device dynamics and failures. This work investigates model–system co-design to address these challenges. We first propose HyPerEdge, a framework that characterises nonlinear and diverse computational performance and employs automated hybrid partitioning that jointly optimises inter-layer and intra-layer partitioning, achieving substantial latency and energy reductions on real edge hardware. To enhance reliability and flexibility, we introduce Fluid Dynamic DNNs, which use nested incremental training to build modular sub-networks capable of performing standalone inference or collaborating with other sub-networks to reconstruct a larger model, thereby enabling adaptive distributed inference and maintaining robustness under device failures. Finally, we present adaptive ensembles of Dynamic DNNs that integrate multi-device scheduling with per-device width selection using a deadline-aware optimisation scheme, enabling fine-grained accuracy–latency trade-offs. Together, these methods provide a comprehensive solution for efficient, reliable, and adaptive distributed DNN inference on heterogeneous edge devices.
-
YPP04.23
Ultra-low Latency and Extreme-throughput Neural Network Accelerators on FPGA
18:30
Atousa Jafari1,
Marco Platzner2
1 researcher
; 2 professor
The goal of this thesis is to study the feasibility, benefits, and limitations of mapping neural network models to FPGA platforms in the direct logic implementation style to achieve ultra-low latency, extreme throughput, and high energy efficiency. To reach this goal, novel and well-established techniques for pruning and quantization at the algorithmic level, as well as hardware-level approximations, are investigated. Through these methods, computational cost is reduced, and redundant connections are eliminated, addressing the scalability issue. In terms of network models, this PhD project focuses on both feed-forward (FFNN) and recurrent neural networks (RNN). In particular, the Echo State Network (ESN), a widely used form of Reservoir Computing (RC), is explored as a promising alternative to conventional RNNs, offering a simpler, less computationally intensive, and more hardware-efficient solution for time-series analysis. Experiments for recurrent networks are conducted on two major categories of tasks: (i) time-series classification (TSC) and (ii) time-series forecasting (regression).
-
YPP04.24
Enabling Large-Scale RTL Simulation and Realistic System Integration
18:30
Guillem López-Paradís1,
Jonathan Balkind2,
Adrià Armejach3,
Miquel Moreto4
1 Barcelona Supercomputing Center
; 2 UC Santa Barbara
; 3 BSC & UPC
; 4 BSC
During the last decade, heterogeneous system-on-chip (SoC) architectures have become universally popular, lately being further enriched with custom accelerators. However, available tools to simulate low-level RTL designs often overlook the specific target system in which the design will eventually operate. This hinders proper testing and debugging of functionalities, and does not allow co-designing the accelerator to obtain a balanced and efficient architecture.
At the same time, despite the widespread adoption of highly parallel multicore processors, which currently offer up to hundreds of cores per chip, RTL simulation remains largely sequential. While most of the software domains take advantage of thread- and node-level parallelism, RTL simulators are usually restricted to a single node and only a handful of threads. This lack of parallelisation creates performance bottlenecks, especially as modern hardware designs grow in scale and complexity, ultimately slowing down verification and architectural exploration.
In this work, we present two open-source tools developed to address these challenges. First, Metro-MPI is a distributed parallel RTL simulation framework that leverages MPI to accelerate simulation throughput across many cores and nodes significantly. Second, gem5+RTL, a framework that integrates RTL modules directly inside the gem5 full-system simulation environment, enabling cycle-accurate accelerator modelling within realistic, OS-driven workloads. We conclude with remarks on future directions and how these tools have been deployed to e.g., accelerate design-space exploration studies.
-
YPP04.25
Architectural Defenses against Hardware Attacks on RlSC-V cores
18:30
Songqiao Cui1,
Ingrid Verbauwhede2,
Josep Balasch3
1 KU Leuven
; 2 KU Leuven - COSIC
; 3 Rambus
Hardware attacks are powerful techniques to extract secret data from electronic devices. By monitoring measurable physical signals at run-time or by actively tampering with a device, an adversary can uncover sensitive application data. The IoT and embedded devices are particularly susceptible to these attacks, as edge nodes are commonly assembled with general-purpose microcontrollers that lack any dedicated protection. While countermeasures exist, implementing them in software is far from optimal. A more promising approach is to integrate them at hardware level. This PhD aims to design, implement and evaluate hardware architectural extensions that provide built-in resistance against hardware attacks. The research activities are centered around RISC-V, an open instruction set architecture that can be freely extended with application-specific modules. This PhD investigates how to prevent data leakages by using a combination of protected execution units and instruction randomization. Additionally, the architectural solutions are extended to additionally detect intentional data tamperings. The performance and security guarantees of the resulting architectures are evaluated and demonstrated using FPGA platforms.
-
YPP04.26
Deeply Understanding and Efficiently Mitigating DRAM Read Disturbance via New Testing Infrastructure and Comprehensive Real-Chip Experimental Studies
18:30
Ataberk Olgun1,
Onur Mutlu1
1 ETH Zurich
Read disturbance (e.g., RowHammer and RowPress) in modern DRAM chips is a widespread phenomenon and is reliably used for breaking memory isolation, a fundamental building block for building robust systems. DRAM chips are increasingly vulnerable to read disturbance phenomena due to DRAM technology scaling. Even though many prior works develop various read disturbance solutions, these solutions incur non-negligible and increasingly higher system performance, energy, and hardware area overheads as read disturbance worsens. This work advances the state-of-the-art by enabling insightful experimental studies using modern DRAM chips via an easy-to-use FPGA-based infrastructure, deepening our understanding of read disturbance in cutting-edge High Bandwidth Memory DRAM chips via rigorous real DRAM chip characterization, uncovering a new read disturbance phenomenon demonstrating that it is challenging to reliably determine the read disturbance susceptibility of a DRAM chip, and designing a new low-
cost and low-overhead read disturbance solution.
-
YPP04.27
Application-Specific Hardware Optimization using Virtual Prototypes
18:30
Jan Zielasko1,
Rolf Drechsler2
1 Cyber-Physical Systems, DFKI GmbH
; 2 University of Bremen/DFKI
Identifying the optimal hardware configuration for running complex workloads such as Neural Network inference on ultra-low-power edge devices is critical for reducing cost and maximizing performance. Tailoring hardware designs to specific applications significantly increases resource utilization, which is essential to meet the strict performance and energy constraints. Unfortunately, exploring the design space at the hardware-level is challenging due to the complexity and time-consuming nature of hardware design processes. In this PhD work, we present an analysis platform based on a RISC-V Virtual Prototype (VP) to systematically identify fine-grained hardware optimization opportunities.
The VP models the entire hardware platform while remaining fast and accessible. Combined with a custom executiontrace compression and analysis framework, it enables the capture and processing of billions of executed instructions.
Applied to a wide range of representative embedded and edge artificial intelligence workloads from the Embench and MLPerf Tiny benchmark suite, our approach successfully identifies promising optimization opportunities beyond the matrix multiplication kernel that are non-trivial to detect from either the source code or gate-level analysis.
-
YPP04.28
A New Era of EDA: A Gen-AI-Aided Framework for Hardware Development
18:30
Kaiyuan Yang1,
John Goodenough1,
Tiantai Deng1
1 The University of Sheffield
AI workloads need specialised accelerators with short time to market across cloud, edge, and terminal devices. However, conventional EDA is slow across abstraction levels, adds physical awareness late, and relies on manual system partitioning. My dissertation addresses these gaps with three elements: Natural-Level Synthesis (NLS), a new design stage from high-level intent to synthesis ready RTL; AI-SoC, a framework for constraint aware system partitioning; and XChip, a generative AI aided toolchain that carries natural language system descriptions through these stages to RTL and GDSII with simple internal checks. I define an evaluation benchmark that combines Quality of Generated Hardware (QGH) from synthesis PPA with Required Design Effort (RDE) from prompt, code, and adjustment measures. Across a range of accelerator designs, the framework reduces design effort while maintaining or improving PPA and yields early partitions that better match real constraints.
-
YPP04.29
Side-Channel Awareness in Neural Network FPGA Accelerators: Security Threats and Opportunities for Functional Safety
18:30
Vincent Meyers1,
Mehdi Tahoori1
1 Karlsruhe Institute of Technology
Neural network accelerators on FPGAs are increasingly deployed in multi-tenant and resource-constrained environments, where data-dependent switching activity exposes them to physical side-channel leakage. This thesis analyzes how such leakage affects confidentiality and reliability in realistic FINN-generated accelerators and develops methods that exploit or mitigate these effects.
We show that accelerator implementation details such as folding significantly shape leakage characteristics. A naive profiled attack recovers layer sizes with only 44% accuracy, while our folding-aware method achieves 100%. We further introduce a generative attack that reconstructs input images from single power traces across devices and conditions and a Trojan-based training approach that boosts output-recovery accuracy by up to 33%.
On the defensive side, we propose side-channel-aware training using differentiable power models to suppress leakage and demonstrate reliable remote fingerprinting using an abstract power model. We also show that on-chip voltage sensors enable concurrent out-of-distribution and hardware-fault detection with negligible overhead.
Overall, the results highlight the dual role of side-channel information as both a security risk and a lightweight source of runtime safety signals.
-
YPP04.30
Towards Lightweight Authentication: PUF‑enabled Mutual Trust and Key-exchange for IoT
18:30
Chandranshu Gupta1,
Gaurav Varshney1
1 Indian Institute of Technology Jammu
The widespread adoption of the Internet of Things (IoT) has brought billions of constrained devices into security-critical environments, yet most cannot support the computation, storage or energy required for conventional cryptographic infrastructures. As a result, many deployments lack robust authentication and secure key establishment, creating significant vulnerabilities. Physical Unclonable Functions (PUFs) offer a promising foundation for addressing these limitations through hardware-embedded uniqueness that avoids storing long-term secrets. This dissertation develops a unified PUF-based framework to provide scalable and lightweight trust for IoT systems. It presents an offline authentication protocol for BLE and Zigbee devices using an Arbiter PUF, a certificate-less PUF-assisted public-key mechanism that eliminates stored certificates, and an SRAM-PUF logic-locking scheme that generates stable device keys for authentication and IP protection. Together, these contributions demonstrate a practical pathway for secure IoT operation without relying on heavyweight cryptography.
-
YPP04.31
Generative AI in the Hardware Design Flow: From High-Level Synthesis to Security
18:30
Luca Collini1,
Ramesh Karri2
1 NYU Tandon School of Engineering
; 2 NYU
Modern silicon technologies, such as sub-10 nm process nodes, 3D-stacked wafers, chiplet-based architectures, and in-package memory, have enabled increasingly energy-efficient and high-performance chips. These advances support new software applications and products, including large-scale AI workloads, real-time edge processing, and immersive augmented or virtual reality, which in turn drive greater demand for specialized hardware. At the same time, these technologies significantly increase design complexity and cost. Developing a chip from specification to prototype can require over $100 million and at least twelve months. Furthermore, as more computation moves to the edge and devices handle growing volumes of user data, security must be considered starting from the hardware level. Security validation and verification often fall on the critical path of the design process, potentially increasing time to market.
-
YPP04.32
Enhancing AI Systems Safety and Reliability
18:30
Vittorio Turco1,
Matteo Sonza Reorda1,
Annachiara Ruospo1
1 Politecnico di Torino
The widespread adoption of Deep Neural Networks (DNNs) in safety-critical domains demands rigorous reliability guarantees against random hardware faults. This research proposes robust approaches for on-line fault detection, providing a software-level and cost-effective alternative to standard reliability methodologies. The works initially introduce early detection techniques based on tensor-related metrics applied to intermediate Output Feature Maps (OFMs), enabling the identification of faults before they corrupt the final output. This methodology is subsequently refined into a two-phase monitoring strategy for classification tasks, combining embedding analysis with on demand Image Test Libraries (ITLs) to identify hard-to-detect faults. Expanding beyond classification, the research addresses complex semantic segmentation tasks by introducing the APSS (Area, Position, Symmetry, Shape) metrics, which perform realtime, "golden-free" output validation. Finally, to enable fair and reproducible assessments of these detection capabilities, a standardized Benchmark Suite has been established, providing a common ground for future reliability studies across different frameworks.
-
YPP04.33
Leveraging Machine Learning in Physical Design Implementation and Verification
18:30
Pooja Beniwal1,
Sneh Saurabh2
1 Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)
; 2 Indraprastha Institute of Information Technology
Electronic design automation (EDA) comprises software tools for designing modern integrated circuits, synthesis, placement, routing, timing analysis, and verification. As designs scale to billions of transistors, challenges arise from increased complexity, tighter timing margins, higher
power-performance-area (PPA) demands, and long runtimes. Machine learning (ML) addresses these issues by learning complex patterns from large design datasets, enabling faster inference and improved optimization. Motivated by these advantages, ML is integrated into static timing analysis (STA)
and VLSI physical design in this research to improve accuracy, reduce pessimism, and design effort, and enable more scalable design flows.
-
YPP04.34
On Cutwidth: Polynomial-Time Formal Verification, Debugging, and Correction for Circuits
18:30
Mohamed Nadeem1,
Rolf Drechsler2
1 University of Bremen
; 2 University of Bremen/DFKI
Ensuring the correct functionality of digital circuits is a critical focus for both academic and industrial research. Formal Verification (FV) is a well-established technique for achieving complete circuit correctness. However, as circuit complexity grew, verification methods faced significant challenges in establishing resource bounds, making verification impractical for many designs. To address this, Polynomial Formal Verification (PFV) is introduced to provide bounded time and space for verification.
In our research, we proposed Linear-Time Formal Verification (LFV) as a subclass of PFV for circuits with constant Cut Width (CW) and proved that verification of combinational and sequential designs can be performed in linear time for both exact and approximate configurations in binary and Multi-Valued Logic (MVL) domains, improving on previously known higher-degree polynomial bounds. We further defined Polynomial Debugging and Fault Correction (PDFC), a subclass of Debugging and Fault Correction (DFC) for circuits with constant CW, and showed that it could be solved in polynomial time and space, thereby enabling efficient debugging and related Electronic Design Automation (EDA) tasks that had not been tackled before.
-
YPP04.35
Design techniques for AI-enabled embedded systems
18:30
Giovanni Pollo1,
Enrico Macii1,
Daniele Jahier Pagliari1,
Sara Vinco1,
Alessio Burrello2
1 Politecnico di Torino
; 2 Politecnico di Torino and Università di Bologna
Designing AI-enabled embedded systems is challenging due to tight resource budgets and complex interactions between hardware and software. On the one hand, on-device AI models must meet strict memory, latency, and energy constraints while delivering accurate real-time decisions which requires dedicated model optimization techniques. On the other hand, accurate system-level modeling is essential to capture closed-loop interactions between algorithms, hardware, and the surrounding environment, and to assess how model-level choices impact end-to-end system behavior. The purpose of this PhD thesis is to support the design of embedded AI systems in complex domains such as biosignals processing and robotics. The first main contribution is an automated Deep Neural Network (DNN) optimization flow, tailored for embedded deployment, which combines Neural Architecture Search (NAS), structured pruning, and quantization to explore accuracy-efficiency trade-offs under tight constraints. The second contribution is a Virtual Prototyping (VP) framework that offers a detailed virtual representation of the target platform, enabling Design-Space Exploration (DSE) of hardware and software options.
-
YPP04.36
AI-Powered Resilient Perception for Autonomous Systems: Lightweight Near-Sensor Point Cloud Corruption Detection
18:30
Grafika Jati1,
Martin Molan1,
Francesco Barchi2,
Andrea Acquaviva1
1 University of Bologna
; 2 Università di Bologna
LiDAR is a critical sensor for autonomous vehicles, helping them detect and understand their surroundings. But what happens when the sensor itself is compromised? In real-world driving, LiDARs can be affected by everyday contaminants like water, mud, dust, or engine oil. These substances can distort the data, causing the vehicle to miss objects—or worse, see things that aren't there—with dangerous levels of confidence. Most current AI models are trained in ideal, clean conditions and are not prepared for these real-world challenges.
Our research tackles this problem by creating the first dataset of real-world corrupted point cloud data affected by physical contamination. We also develop an intelligent safety layer that detects when LiDAR data is unreliable—before it affects critical decisions like braking or turning. Designed to run on lightweight edge devices, our system is fast, efficient, and adaptable. In future work, we aim to make the system smarter through few-shot learning and uncertainty-based decision support, helping autonomous vehicles become safer and more trustworthy in the messy, unpredictable real world. Data and code will be available at: https://gitlab.com/ecs-lab/lidaroc, https://gitlab.com/ecs-lab/anzil, and https://gitlab.com/ecs-lab/distilling-pointcloud-corruption.
-
YPP04.37
Efficient Spiking Neural Networks for Edge-Based Auditory Perception
18:30
Shreya Kshirasagar1,
Christian Mayr2
1 Robert Bosch GmbH (Bosch Research)
; 2 TU Dresden
The increasing demand for real-time auditory perception in resource-constrained environments, such as automotive edge devices, necessitates the development of efficient and robust neural network architectures. This PhD research focuses on leveraging Spiking Neural Networks (SNNs) to address the challenges of auditory perception, specifically targeting the detection of siren sounds in noisy environments. The proposed SpikeSireNet architecture demonstrates comparable accuracy while significantly reducing model size and computational requirements compared to conventional Recurrent Neural Networks (RNNs).