ACM DL

ACM Transactions on

Design Automation of Electronic Systems (TODAES)

Menu
Latest Articles

Quality-Enhanced OLED Power Savings on Mobile Devices

In the future, mobile systems will increasingly feature more advanced organic light-emitting diode (OLED) displays. The power consumption of these... (more)

Switching Predictive Control Using Reconfigurable State-Based Model

Advanced control methodologies have helped the development of modern vehicles that are capable of path planning and path following. For instance,... (more)

NEWS

ACM TODAES new page limit policy: Manuscripts must be formatted in the ACM Transactions format; a 35-page limit applies to the final paper. Rare exceptions are possible if recommended by the reviewers and approved by the Editorial Board.

ORCID is a community-based effort to create a global registry of unique researcher identifiers for the purpose of ensuring proper attribution of works to their creators. When you submit a manuscript for review, you will be presented with the opportunity to register for your ORCID.

Welcome ACM Associate Editors

Forthcoming Articles
Data-driven Anomaly Detection with Timing Features for Embedded Systems

Malware is a serious threat to network-connected embedded systems, as evidenced by the continued and rapid growth of such devices, commonly referred to as of the Internet of Things. Their ubiquitous use in critical applications require robust protection to ensure user safety and privacy. That protection must be applied to all system aspects, extending beyond protecting the network and external interfaces. Anomaly detection is one of the last lines of defence against malware, in which data-driven approaches that require the least domain knowledge are popular. However, embedded systems, particularly edge devices, face several challenges in applying data-driven anomaly detection, including unpredictability of malware, limited tolerance to long data collection windows, and limited computing/energy resources. In this paper, we utilize subcomponent timing information of software execution, including intrinsic software execution, instruction cache misses, and data cache misses as features, to detect anomalies based on ranges, multidimensional Euclidean distance, and classification at runtime. Detection methods based on lumped timing range are also evaluated and compared. We design several hardware detectors implementing these data-driven detection methods, which non-intrusively measuring lumped/subcomponent timing of all system/function calls of the embedded application. We evaluate the area, power, and detection latency of the presented detector designs. Experimental results demonstrate that the subcomponent timing model provides sufficient features to achieve high detection accuracy with low false positive rates using a one-class support vector machine, considering sophisticated mimicry malware.

Boundary-Functional Broadside and Skewed-Load Tests

Close-to-functional broadside tests are used for avoiding overtesting of delay faults that can result from non-functional operation conditions, while avoiding test escapes because of faults that cannot be detected under functional operation conditions. When a close-to-functional broadside test deviates from functional operation conditions, the deviation can affect the entire circuit. This paper defines the concept of a boundary-functional broadside test where non-functional operation conditions are prevented from crossing a preselected boundary. Using the procedure described in this paper, the boundary maintains the same values under a boundary-functional broadside test as under a functional broadside test from which it is derived. Indirectly, this ensures that the deviations from functional operation conditions throughout the entire circuit are limited. The concept of a boundary-functional broadside test is extended to skewed-load tests, and to partial-boundary-functional tests. Experimental results are presented for benchmark circuits to demonstrate the fault coverage improvements that can be achieved using boundary-functional broadside and skewed-load tests as well as partial-boundary-functional tests of both types.

Programmable Gates Using Hybrid CMOS-STT Design to Prevent IC Reverse Engineering

This paper presents a rigorous step towards design-for-assurance by introducing a new class of logically reconfigurable design resilient to design reverse engineering. Based on the non-volatile spin transfer torque (STT) magnetic technology, we introduce a basic set of non-volatile reconfigurable Look-Up-Table (LUT) logic components (NV-STT-based LUTs). STT-based LUT with significantly different set of characteristics compared to CMOS provides new opportunities to enhance design security yet makes it challenging to remain highly competitive with custom CMOS or even SRAM-based LUT in terms of power, performance and area. To address these challenges, we propose several algorithms to select and replace custom CMOS gates with reconfigurable STT-based LUTs during design implementation such that the functionality of STT based components and therefore the entire design cannot be determined in any manageable time, rendering any design reverse engineering attack ineffective. Our study conducted on a large number of standard circuit benchmarks concludes significant resiliency of hybrid STT-CMOS circuits against various types of attacks. Furthermore, the selection algorithms on average have a small impact on the performance of the circuit. We also tested these techniques against satisfiability attacks developed recently and show that these techniques also render more advanced reverse-engineering techniques computationally infeasible.

Optimal Allocation of Computation and Communication in an IoT Network

Internet of things (IoT) is being developed for a wide range of applications from home automation and personal fitness, to smart cities. With the extensive growth in adaptation of IoT devices, comes the uncoordinated and substandard designs aimed at promptly making products available to the end consumer. This substandard approach restricts the growth of IoT in the near future and necessitates studies to understand requirements for an efficient design. A particular area where IoT applications have grown significantly is the surveillance and monitoring. Applications of IoT in this domain are relying on distributed sensors, each equipped with a battery, capable of collecting images, processing images, and communicating the raw or processed data to the nearest node until it reaches the base station for decision making. In such an IoT network where processing can be distributed over the network, the important research question is how much of data each node should process and how much it should communicate for a given objective. This work answers this question and provides a deeper understanding of energy and delay trade-offs in an IoT network with three different target metrics.

SynergyFlow: An Elastic Accelerator Architecture Supporting Batch Processing of Large-Scale Deep Neural Networks

Neural networks (NN) have achieved great success in a broad range of applications. As NN-based methods are often both computation and memory intensive, accelerator solutions have been proved to be highly promising in terms of both performance and energy efficiency. Although prior solutions can deliver high computational throughput for convolutional layers, they could incur severe performance degradation when accommodating the entire network model, because there exist very diverse computing and memory bandwidth requirements between convolutional layers and fully-connected layers, and furthermore, among different NN models. To overcome this problem, we proposed an elastic accelerator architecture, called SynergyFlow, which intrinsically supports layer-level and model-level parallelism for large-scale deep neural networks. Our design boosts the resource utilization by exploiting the complementary effect of resource demanding in different layers and different NN models. SynergyFlow can dynamically reconfigure itself according to the workload characteristics, maintaining a high performance and high resource utilization among various models. As a case study, we implement SynergyFlow on a P395-AB FPGA board. Under 100MHz working frequency, our implementation improves the performance by 33.8% on average (up to 67.2% on AlexNet) compared to comparable provisioned previous architectures.

Harvesting Row-Buffer Hits via Orchestrated Last-Level Cache and DRAM Scheduling for Heterogeneous Multicore Systems

In heterogeneous multicore systems, the memory subsystem, including the last-level cache and DRAM, is widely shared among the CPU, the GPU, and the real-time cores. Due to their distinct memory traffic patterns, heterogeneous cores result in more frequent cache misses at the last-level cache. As cache misses travel through the memory subsystem, two schedulers are involved for the last-level cache and DRAM respectively. Prior studies treated the scheduling of the last-level cache and DRAM as independent stages. However, with no orchestration and limited visibility of memory traffic, neither scheduling stage is able to ensure optimal scheduling decisions for memory efficiency. Unnecessary precharges and row activations happen in DRAM when the memory scheduler is ignorant of incoming cache misses and DRAM row-buffer states are invisible to the last-level cache. In this paper, we propose a unified memory controller for the the last-level cache and DRAM with orchestrated schedulers. The memory scheduler harvests row-buffer hit opportunities in cache request buffers during spare time without inducing significant implementation cost. Extensive evaluations show that the proposed controller improves the total memory bandwidth of DRAM by 16.8% on average and saves DRAM energy by up to 29.7% while achieving comparable CPU IPC. In addition, we explore the potential of the proposed memory controller to attain improvements on both memory bandwidth and CPU IPC.

Knowledge and Simulation Based Synthesis of Area-Efficient Passive Loop Filter Incremental Zoom-ADC for Built-In Self-Test Applications

We propose a passive, fully-differential, synthesizable zoom-ADC architecture for BIST applications, along with a synthesis tool that can target various design specifications. We present the detailed ADC architecture and a step by step process designing the zoom-ADC. The design flow does not rely on extensive knowledge of an experienced ADC designer. Two ADCs have been synthesized with different performance requirements in 65nm CMOS process. The first ADC achieves 91dB SNR in 512¼s measurement time and consumes 14.95¼W power. The second design achieves 78.2dB SNR in 31.25¼s measurement time and consumes 60¼W power.

ERASMUS: Efficient Remote Attestation via Self-Measurement for Unattended Settings

Remote attestation (RA) is a popular means of detecting malware in embedded and IoT devices. RA is usually realized as an interactive protocol, whereby a trusted party  verifier  measures integrity of a potentially compromised remote device  prover. Early work focused on purely software-based and fully hardware-based techniques, neither of which is ideal for low-end devices. More recent results have yielded hybrid (SW/HW) security architectures comprised of a minimal set of features to support efficient and secure RA on low-end devices. All prior RA techniques require on-demand operation, i.e, RA is performed in real time. We identify some drawbacks of this general approach in the context of unattended devices: First, it fails to detect mobile malware that enters and leaves the prover between successive RA instances. Second, it requires the prover to engage in a potentially expensive (in terms of time and energy) computation, which can be harmful for critical or real-time devices. To address these drawbacks,we introduce the concept of self-measurement where a prover device periodically (and securely) measures and records its own software state, based on a pre-established schedule. A possibly untrusted verifier occasionally collects and verifies these measurements. We present the design of a concrete technique called ERASMUS: Efficient Remote Attestation via Self-Measurement for Unattended Settings, justify its features and evaluate its performance. In the process, we also define a new metric  Quality of Attestation (QoA). We argue that ERASMUS is well-suited for time-sensitive and/or safety-critical applications that are not served well by on-demand RA. Finally, we show that ERASMUS is a promising stepping stone towards handling attestation of multiple devices (i.e., a group or swarm) with high mobility.

Reducing Writebacks Through In-Cache Displacement

Non-Volatile Memory (NVM) technology is a promising solution to fulfill the ever-growing need for higher capacity in the main memory of modern systems. Despite having many great features, NVM?s poor write performance remains a severe obstacle, preventing it from being used as a DRAM alternative in the main memory. Most of the prior work targeted optimizing writes at the main memory side and neglected the decisive role of upper-level cache management policies on reducing the number of writes. In this paper, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC), for the sake of reducing the number of costly writes to the off-chip NVM. We decouple a few physical ways of the LLC to have a dedicated and exclusive storage for the dirty blocks, after being evicted from the cache and before being sent to the off-chip memory. By displacing dirty blocks in the exclusive storage, they are kept in the cache based on their rewrite distance and are evicted when they are unlikely to be reused shortly. To maximize the effectiveness of the exclusive storage, we manage it as a Cuckoo Cache to offer associativity based on the various applications? demands. Through detailed evaluations targeting various single- and multi-threaded applications, we show that our proposal reduces the number of writebacks, on average, by 21% over the state-of-the-art method and enhances both performance and energy efficiency.

Optimization of Fault-Tolerant Mixed-Criticality Multi-Core Systems with Enhanced WCRT Analysis

This paper proposes a novel optimization technique of fault-tolerant mixed-criticality multi-core systems with worst-case response time (WCRT) guarantees. Typically, in fault-tolerant multi-core systems, tasks can be replicated or re-executed in order to enhance the reliability. In addition, based on the policy of mixed-criticality scheduling, low-criticality tasks can be dropped at runtime. Such uncertainties caused by hardening and mixed-criticality scheduling make WCRT analysis very difficult. We show that previous analysis techniques are pessimistic as they consider avoidably extreme cases that can be safely ignored within the given reliability constraint. We improve the analysis in order to tighten the pessimism of WCRT estimates by considering the maximum number of faults to be tolerated. Further, we improve the mixed-criticality scheduling by allowing partial dropping of low-criticality tasks. On top of those, we explore the design space of hardening, task-to-core mapping, and quality-of-service of the multi-core mixed-criticality systems. The effectiveness of the proposed technique is verified by extensive experiments with synthetic and real-life benchmarks.

A Hardware-Efficient Block Matching Algorithm and Its Hardware Design for Variable Block Size Motion Estimation in Ultra-High-Definition Video Encoding

Variable block size motion estimation has contributed greatly to achieving an optimal inter-frame encoding, but meanwhile involves high computational complexity and huge memory access, which is the most critical bottleneck in ultra-high-definition video encoding. This paper presents a hardware-efficient block matching algorithm with an efficient hardware design, which is able to reduce the computational complexity of motion estimation while providing a sustained and steady coding performance for high-quality video encoding. A three-level memory organization is proposed to reduce memory bandwidth requirement while supporting a predictive common search window. By applying multiple search strategies and early termination, the proposed design provides 1.8 to 3.7 times higher hardware efficiency than other works. Furthermore on-chip memory has been reduced by 96.5% and o?-chip bandwidth requirement has been reduced by 39.4% thanks to the proposed three-level memory organization. The corresponding power consumption is only 198 mW at the highest working frequency of 500 MHz. The proposed design is attractive for high-quality video encoding in real-time applications with low power consumption.

3-D Floorplan Representations by Using Corner Links and Partial Order

We propose a new 3-D IC floorplan representation methodology using corner links and partial order. In this paper, (1) we introduce our novel 3-D IC floorplan representation, called corner links, (2) we analyze the equivalence relation between the corner links and the corresponding partial order representations, and (3) we discuss several key properties of the corner links and partial order representations. We demonstrate that the corner links representation can be reduced to their corresponding partial order representation. Also, the corner links representation for the non-degenerate 3-D mosaic floorplan can be equivalently expressed by the four tree representation. The partial order representation defines the topological structure of the 3-D mosaic floorplan with three transitive closure graphs for each direction and captures all cutting planes in the floorplan in the order of their respective directions. If the partial order representation describes relations between all pairs of blocks in the 3-D floorplan, then the floorplan is a valid floorplan. We show that the partial order representation can restore the absolute coordinates of all blocks in the 3-D mosaic floorplan by using the given physical dimensions of blocks.

Automatic Optimization of the VLAN Partitioning in Automotive Communication Networks

Dividing the communication network into so-called virtual local area networks(VLANs), i.e., subnetworks which are isolated at the data link layer (OSI layer 2), is a promising approach to address the increasing security challenges in automotive networks. The automation of the VLAN partitioning is a well researched problem in the area of local or metropolitan area networks. However, the approaches used there are hardly applicable for the design of automotive networks as they mainly focus on reducing the amount of broadcast traffic and cannot capture the many design objectives of automotive networks like the message timing or the link load, which are affected by the VLAN partitioning. As a remedy, this article proposes a 0-1 ILP-based approach to generate a message routing which is feasible with respect to the VLAN-related routing restrictions in automotive networks. This approach can be used for a design space exploration to optimize not only the VLAN partitioning, but also other routing-related objectives. We demonstrate both the efficiency of our message routing approach and the now accessible optimization potential for the complete E/E architecture using a mixed-criticality system from the automotive domain.

Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification

Modern Systems-on-Chip (SoC) designs are increasingly heterogeneous and contain specialized semi-programmable accelerators in addition to programmable processors. In contrast to the pre-accelerator era, when the ISA played an important role in verification by enabling a clean separation of concerns between software and hardware, verification of these accelerator-rich SoCs presents new challenges. From the perspective of hardware designers, there is a lack of a common framework for formal functional specification of accelerator behavior. From the perspective of software developers, there exists no unified framework for reasoning about software/hardware interactions of programs that interact with accelerators. This paper addresses these challenges by providing a formal specification and high-level abstraction for accelerator functional behavior. It formalizes the concept of an Instruction Level Abstraction (ILA), developed informally in our previous work on abstraction synthesis, and shows its application in modeling and verification of accelerators. This formal ILA extends the familiar notion of instructions to accelerators and provides a uniform, modular, and hierarchical abstraction for modeling software-visible behavior of both accelerators and programmable processors. We demonstrate the applicability of the ILA through several case studies of accelerators (for image processing, machine learning and cryptography), and a general-purpose processor (RISC-V). We show how the ILA model facilitates equivalence checking between two ILAs, and between an ILA and its hardware finite-state machine (FSM) implementation. Further, this equivalence checking supports accelerator upgrades using the notion of ILA compatibility, similar to processor upgrades using ISA compatibility.

Performance-Aware Test Scheduling for Diagnosing Coexistent Channel Faults in Topology-Agnostic Networks-on-Chip

High-performance multiprocessor SoCs (MPSoCs) that are being used today require a complex network-on-chip (NoC) as communication architecture, and the channels therein often suffer from various manufacturing defects. Such physical defects cause a multitude of system-level failures and subsequent degradation of reliability, yield, and performance of the computing platform. Most of the commonly practiced test approaches consider mesh-based NoC channels only, and do not perform well for other topologies such as octagon or spidergon networks with regard to test time and overhead issues. This paper proposes a cost-effective and topology-agnostic test mechanism that is capable of diagnosing on-line, co-existent channel-short and stuck-at faults in these special NoCs as well as in traditional mesh architectures. We introduce a new test-model called \textit{Damaru} to decompose the network and present an efficient scheduling scheme to reduce test-time without compromising resource utilization during testing. Additionally, the proposed scheduling scheme scales well with network size, channel width, and topology. Simulation results show that the method achieves nearly 92% fault coverage, and improves area overhead by almost 60\%, and test-time by 98% compared to earlier approaches. As a sequel, packet latency and energy consumption are also improved by 67.05% and 54.69%, respectively, and they further improve with increasing network size.

Probabilistic Evaluation of Hardware Security Vulnerabilities

Various design techniques can be applied to implement the finite state machine (FSM) functions in order to optimize timing, performance, power and reduce overhead. Recently, malicious attacks to hardware systems have emerged as a critical problem. Fault injection attacks, in particular, alter the function or reveals the critical information of a hardware system through precisely controlled fault injection processes. Attackers can utilize the loopholes and vulnerabilities of FSM functions to access the states that are under protection. A probabilistic model is developed in this paper to evaluate the potential vulnerabilities of FSM circuits at the design stage. Analysis based on the statistical behaviors of FSM also shows that the induced circuit errors can be exploited to access the protected states. An effective solution based on state re-encoding is proposed to minimize the risk of unauthorized transitions. Simulation results demonstrate that vulnerable transition paths can be protected with small hardware overheads.

SystemC-AMS Thermal Modeling for the Co-simulation of Functional and Extra-Functional Properties

Temperature is a critical property of smart systems, due to its impact on reliability and to its inter-dependence with power consumption. Unfortunately, the current design flows evaluate thermal evolution ex-post, on offline power traces. This does not allow to consider temperature as a dimension in the design loop, and it misses all the complex inter-dependencies with design choices and power evolution. In this paper, by adopting the functional language SystemC-AMS, we propose a method to enable thermal/power/functional co-simulation. The system thermal model is built by using state-of-the-art circuit equivalent models, by exploiting the support for electrical linear networks intrinsic of SystemC-AMS. The experimental results will show that the choice of SystemC-AMS is a winning strategy for building a simultaneous simulation of multiple functional and extra-functional properties of a system. The generated code exposes an accuracy comparable to that of reference thermal simulator HotSpot. Additionally, the initial overhead due to the general purpose nature of SystemC-AMS is compensated by surprisingly high performance of transient simulation, with speedups as high as two orders of magnitude. The application of the proposed methodology to a set of benchmarks, used for the IEEE PATMOS design contest, will additionally prove the effectiveness of the SystemC-AMS thermal simulator.

Efficiently Managing the Impact of Hardware Variability on GPUs' Streaming Processors

GPUs are widely used in general-purpose high performance computing field due to their highly parallel architecture. In recent years, a new era with nanometer scale integrated circuit manufacture process has come, as a consequence, GPUs' computation capability gets even stronger. However, as process technology scales down, hardware variability, e.g., process variations (PVs) and negative bias temperature instability (NBTI), has a higher impact on the chip quality. The parallelism of GPU desires high consistency of hardware units on chip, otherwise, the worst unit will inevitably become the bottleneck. So the hardware variability becomes a pressing concern to further improve GPUs' performance and lifetime, not only in integrated circuit fabrication, but more in GPU architecture design. Streaming Processors (SPs) are the key units in GPUs, which perform most of parallel computing operations. Therefore, in this work, we focus on mitigating the impact of hardware variability in GPU SPs. We first model and analyze SPs' performance variations under hardware variability. Then, we observe that both PV and NBTI have large impact on SP's performance. We further observe unbalanced SP utilization, e.g., some SPs are idle when others are active, during program execution. Leveraging this observation, we propose a Hardware Variability-aware SPs' Management policy (HVSM), which dynamically dispatches computation in appropriate SPs to balance the utilizations. In addition, we find a large portion of compute operations are duplicate. We also propose an Operation Compression (OC) technique to minimize the unnecessary computations to further mitigate the hardware variability effects. Our experimental results show the combined HVSM and OC technique effectively reduces the impact of hardware variability, which can translate to 37% performance improvement or 18.3% lifetime extension for a GPU chip.

Writeback-Aware LLC Management for PCM-based Main Memory Systems

With the increase in the number of data-intensive applications on today?s workloads, DRAM-based main memories are struggling to satisfy the growing data demand capacity. Phase Change Memory (PCM) is a type of non-volatile memory technology that has been explored as a promising alternative for DRAM-based main memories due to its better scalability and lower leakage energy. Despite its many advantages, PCM has also shortcomings such as long write latency, high write energy consumption, and limited write endurance, which are all related to the write operations. In this paper, we propose a novel writeback-aware Last Level Cache (LLC) management scheme named WALL to reduce the number of LLC writebacks and consequently improve performance, energy efficiency, and lifetime of a PCM-based main memory system. First, we investigate the writeback behavior of LLC sets and show that writebacks are not uniformly distributed among sets; some sets observe much higher writeback rates than others. We then propose a writeback-aware set-balancing mechanism, which employs the underutilized LLC sets with few writebacks as an auxiliary storage for the evicted dirty lines from sets with frequent writebacks. We also propose a simple and effective writeback-aware replacement policy to avoid the eviction of the dirty blocks that are highly reused after being evicted from the cache. Our experimental results show that WALL achieves an average of 30.9% reduction in the total number of LLC writebacks, compared to the baseline scheme. As a result, WALL can significantly reduce the memory energy consumption and enhance PCM lifetime.

All ACM Journals | See Full Journal Index

Search TODAES
enter search term and/or author name