Due to increasing diversity and complexity of applications in embedded systems, accelerator designs trading-off area/energy-efficiency and design-productivity are becoming a further crucial issue. Targeting applications in the category of Recognition, Mining, and Synthesis (RMS), this study proposes a novel accelerator design to achieve a good trade-off in efficiency and design-productivity (or reusability) by introducing a new computing paradigm called "approximate computing (AC)." Leveraging from the facts that frequently-executed parts of applications (i.e., hotspots) are conventionally the target of acceleration and that RMS applications are error-tolerant and often take similar input data repeatedly, our proposed accelerator reuses previous computational results of similar enough data to reduce computations. The proposed accelerator is composed of a simple controller and a dedicated memory to store limited sets of previous input data with corresponding computational results in a hotspot. Therefore, this accelerator can be applied to different and/or multiple hotspots/applications only through small extension of the controller, to achieve efficient accelerator design and resolve the design-productivity issue. We conducted quantitative evaluations using a representative RMS application (image compression) to demonstrate the effectiveness of our method over conventional ones with precise computing. Moreover, we provide important findings on parameter exploration for our accelerator design, offering a wider applicability of our accelerator to other applications.
The globalization of the semiconductor supply chain introduces ever-increasing security and privacy risks. Two major concerns are IP theft through reverse engineering and malicious modification of the design. The latter concern in part relies on successful reverse engineering of the design as well. IC camouflaging and logic locking are two research techniques that can thwart reverse engineering by end-users or foundries. However, developing low overhead locking/camouflaging schemes that can resist the ever-evolving state-of-the-art attacks has been a research challenge for several years. This article provides a comprehensive review of the state-of-art with respect to locking/camouflaging techniques. We start by defining a systematic threat model for these techniques and discuss how various real-world scenarios relate to each threat model. We then discuss the evolution of generic algorithmic attacks under each threat model leading to the strongest existing attacks. The paper then systematizes defences, discussing attacks that are more specific to certain kinds of locking/camouflaging. In conclusion the paper discusses open problems and future directions.
This paper presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows that of conventional silicon-based microprocessors, and; CGRAlike overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units. Time-multiplexing the overlay allows it to change its behavior with a cycle-bycycle execution of the application kernel, thus allowing better sharing of the limited FPGA hardware resource. However, most TM overlays suffer from large resource overheads, due to either the underlying processor-like architecture (for processor-based overlays) or due to the routing array and instruction storage requirements (for CGRA-like overlays). Reducing the area overhead for CGRA-like overlays, specifically that required for the routing network, and better utilizing the hard macros in the target FPGA are active areas of research.
Heterogeneous multichip architectures have gained significant interest in high performance computing clusters to cater to a wide range of applications. In particular, heterogeneous systems with multiple multicore CPUs, GPUs and memory have become common place to meet application requirements. The shared resources like interconnection network in such systems pose significant challenges due to the diverse traffic requirements of CPUs and GPUs. Especially, the performance and energy consumption of inter-chip communication have remained a major bottleneck due to limitations imposed by off-chip wired links. To overcome these challenges, we propose a wireless interconnection network to provide energy efficient, high performance communication in heterogeneous multi-chip systems. Interference free communication between GPUs and memory modules is achieved through directional wireless links, while omni-directional wireless interfaces connect cores in the CPUs with other components in the system. Besides providing low energy, high bandwidth inter-chip communication, the wireless interconnection scales efficiently with system size to provide high performance across multiple chips. The proposed inter-chip wireless interconnection is evaluated on two system sizes with multiple CPU and multiple GPU chips, along with main memory modules. On a system with 4 CPU and 4 GPU chips, application runtime is sped up by 3.94×, packet energy is reduced by 94.4% and packet latency is reduced by 58.34% as compared to baseline system with wired inter-chip interconnection.
The trade-off between analyzability and expressiveness is a key factor when choosing a suitable dataflow model of computation (MoC) for designing, modeling and simulating applications considering a formal base. A large number of techniques and analysis tools exist for static dataflow models, such as synchronous dataflow. However, they cannot express dynamic behavior required for more dynamic applications in signal streaming or to model runtime reconfigurable systems. On the other hand, dynamic dataflow models like Kahn process networks sacrifice analyzability for expressiveness. Scenario-aware dataflow (SADF) is an excellent trade-off providing sufficient expressiveness for dynamic systems, while still giving access to powerful analysis methods. In spite of an increasing interest in SADF methods, there is a lack of formally-defined functional models for describing and simulating SADF systems. This paper overcomes the current situation by introducing a functional model for the SADF MoC, as well as a set of abstract operations for simulating it. We present the first modeling and simulation environment for SADF so far and demonstrate its capabilities through a comprehensive tutorial-style example of a RISC processor described as an SADF application, as well as the modeling of an MPEG-4 simple profile decoder. Finally, we discuss the potential of our formal model as a frontend for formal system design flows regarding dynamic applications.
Approximate computing is a promising design paradigm. By allowing the inexact computation in error-tolerance applications, approximate computing can gain both performance and energy-efficiency. A neural network~(NN) is a universal approximator in theory. The emerging DNN accelerators deployed with NN-based approximator is thereby a promising candidate for approximate computing. Nevertheless, the approximation result must satisfy the users' requirement. We normally deploy a NN-based classifier to ensure the approximation quality. Only the inputs meet the quality requirement can be executed by the approximator. The potential of these two NNs, however, is fully explored; the involving of two NNs in approximate computing imposes critical optimization questions, such as two NN's distinct views of the input data space, how to train the two correlated NNs, what are their typologies. In this paper, we propose a novel NN-based approximate computing framework with quality insurance. We advocate a co-training approach that trains the classifier and the approximator alternately. In each iteration, we coordinate the training of the two NNs with a judicious selection of training data. Next, we explore different selection policies and propose to select training data from multiple iterations. Also, we optimize the classifier by integrating a dynamic threshold tuning algorithm. We propose two efficient algorithms to explore the smallest topology of the NN-based approximator and the classifier. Experimental results show significant improvement on the quality and the energy-efficiency compared to the existing NN-based approximate computing frameworks.
Battery operated low-power portable computing devices are becoming an inseparable part of human daily life. One of the major goals is to achieve the longest battery life in such a device. Additionally, the need for performance in processing multimedia content is ever increasing. Processing image and video content consume more power than other applications. A common approach to improving energy efficiency is to implement the computationally intensive functions as digital hardware accelerators. Spatial filtering is one of the most commonly used methods of digital image processing. As per the Fourier theory, an image can be considered as a two-dimensional signal that is composed of spatially extended two-dimensional sinusoidal patterns called gratings. Spatial frequency theory states that sinusoidal gratings can be characterised by its spatial frequency, phase, amplitude and orientation. This paper presents results from our investigation into assessing the impact of these characteristics of a digital image on the energy efficiency of hardware accelerated spatial filters employed to process the same image. Two greyscale images each of size 128x128 pixels comprising of two-dimensional sinusoidal gratings at maximum spatial frequency of 64 cycles per image orientated at 0 and 90 degrees respectively, were processed in a hardware implemented Gaussian smoothing filter. The energy efficiency of the filter was compared with the baseline energy efficiency of processing a featureless plain black image. The results show that energy efficiency of the filter drops to 12.5% when the gratings are orientated at 0 degrees whilst rises to 72.38% at 90 degrees.