The threads of GPUs are grouped into SIMD batches executing the same instruction on vectors of data in a lock-step. The register files of GPU are huge due to each SIMD group accessing a dedicated set of vector registers for fast context switching, and consequently the power consumption of register files has become an important issue. One proposed solution is to replace some of the vector registers by scalar registers. However, it has been observed that a significant number of registers containing affine vectors v such that v[i] = b + i × s can be represented by base b and stride s. Therefore, this paper proposes an affine register file design for GPUs that is energy efficient due to it reducing the redundant executions of both the uniform and affine vectors. This design uses a pair of registers to store the base and stride of each affine vector and provides specific affine ALUs to execute affine instructions. A method of compiler analysis has been developed to detect scalars and affine vectors and annotate instructions for facilitating their corresponding scalar and affine computations. Furthermore, a priority-based register allocation scheme has been implemented to assign scalars and affine vectors to appropriate scalar and affine register files. Experimental results show that this design reduced the energy consumptions of the register files and ALUs to 21.86% and 26.54%, respectively, and it reduced the overall energy consumption of the GPU by an average of 5.18%.
As designs continue to grow in size and complexity, EDA paradigm shifts from flat to hierarchical timing analysis. In this paper, we present compact and accurate timing macro modeling, which is the key to efficient and accurate hierarchical timing analysis. Our goal is to contain only a minimal amount of interface logic in our timing macro model. The main idea is to separate the interface logic into variant and constant timing regions. Then, the variant timing region is reserved for accuracy, while the constant timing region is reduced for compactness. For reducing the constant timing region, we propose anchor pin insertion and deletion by generalizing existing timing graph reduction techniques. Furthermore, we devise a lookup table index selection technique to achieve high model accuracy over the possible operating condition range. Compared with two common models used in industry, extracted timing model and interface logic model, our model has high model accuracy and small model size. Based on the TAU 2016 and 2017 timing macro modeling contest benchmark suites, our results show that our algorithm delivers superior efficiency and accuracy: Hierarchical timing analysis using our model can significantly reduce runtime and memory compared with flat timing analysis on the original design. Moreover, our algorithm outperforms TAU 2016 and 2017 contest winners in model accuracy, model size, model generation performance, and model usage performance.
Due to its complexity, the problem of mapping and scheduling streaming applications on heterogeneous MPSoCs under real-time and performance constraints has traditionally been tackled by incomplete heuristic algorithms. In recent years, approaches based on Constraint Programming (CP) have shown promising results as complete methods for finding optimal mappings, in particular concerning throughput. However, so far none of the available CP approaches consider the trade-off between throughput and buffer requirements or throughput and energy consumption. This paper integrates trade-off awareness into the CP model and introduces an n-step solving approach that utilizes the advantages of heuristics, while still keeping the completeness property of CP. With a number of experiments considering several streaming applications and different platform models, the paper illustrates not only the efficiency of the presented model, but also its suitability for solving different problems with various combinations of performance constraints.
Modern microprocessors contain a variety of mechanisms used to mitigate errors in the logic and memory. Collectively, these methods are referred to as Reliability, Availability and Serviceability (RAS) techniques. Many of these techniques, such as component disabling, come at a performance cost. With feature size of the processing fabric reducing and increasingly more functionality integrated per chip, it is reasonable to expect that chip-wide error rates will intensify in the future and perhaps vary throughout system lifetime. As a result, it is important to reclaim the temporal RAS overheads in a systematic way, so that dependable performance can be enabled. The current paper presents a closed-loop control scheme that actuates the fre- quency of the processor based on detected timing interference in order to enable performance dependability. The concepts of slack and deadline vulnerability factor are introduced to enable dependable performance and support the formulation of a discrete time control problem. Default application timing is derived using the system scenario methodology, the applicability of which is demonstrated through simulations. Additionally, the proposed concept is demontrated on a real platform and application: a Proportional-Integral-Differential (PID) controller, implemented within the application, actuates the Dynamic Voltage and Frequency Scaling (DVFS) framework of the Linux kernel to effectively reclaim temporal overheads injected at run time. The current paper discusses the responsiveness and energy efficiency of the proposed performance dependability scheme. For a wide variety of timing noise interference patterns, the proposed scheme succeeds in guaran- teeing dependable performance on a real platform and a relevant application. Finally, additional formulation is introduced in order to predict the upper bound of timing interference that can be absorbed by actuating the DVFS of any processor and is also validated on a representative reduction to practice.
The Better-Than-Worst-Case (BTW) design methodology can achieve higher circuit energy efficiency, performance, or reliability by allowing timing errors for rare cases and rectifying them with error correction mechanisms. Therefore, the performance of BTW design heavily depends on the correctness of common cases, which are frequent input patterns in a workload. However, most existing methods do not provide sufficiently scalable solutions and also overlook the whole picture of the design. Thus, we propose a new technique, C-Mine, which combines two scalable techniques, data mining and SAT solving, to overcome these limitations. Data mining can efficiently extract patterns from an enormous data set, and SAT solving is famous for its scalable verification. In this work, we present two versions of C-Mine, C-Mine-DCT and C-Mine-APR, which aim at faster runtime and better energy saving, respectively. The experimental results show that, compared to a recent publication, C-Mine-DCT can achieve compatible performance with an additional 8% energy saving and 54x speedup for bigger benchmarks on average. Furthermore, C-Mine-APR can achieve up to 13% more energy saving than C-Mine-DCT while confronting designs with more common cases.
This paper presents a novel approach to overcome the challenges of third-party IP-integration based on software-defined System-on-Chips (SoC) and Graph Grammars. The IP-supplier prepares a HW-accelerated software library (HASL) for the SoC architect. As a key point of our approach, HW integration knowledge is encoded in the library as a set of rules. These rules are defined in the machine-readable standardized IP-XACT IEEE 1685-2009 format. The library preparation step also includes the generation of configurable HW drivers, schedulers and the software library functions. For the SoC architect, we have developed the graph-grammar-based IP-integration (GRIP) tool. The software application is developed using the functions supplied in the HASL. According to the utilized functions, the GRIP tool automatically integrates IP-blocks using the rule information supplied with the library. The SoC architecture and rules are transformed into the graph domain to apply graph rewriting methods. Iterative rule application spans a design space search tree with multiple candidate SoC architectures for a targeted software-defined SoC. The GRIP tool is model-driven and based on the Eclipse Modeling Framework (EMF). Applying code generation techniques, SoC candidate architectures can be transformed to hardware descriptions for the target platform. The HW/SW interfaces between SW library functions and IP blocks can be automatically generated for bare-metal or Linux-based applications. The approach is demonstrated with a case-study on the Xilinx Zynq-based ZedBoard evaluation board using a HASL for computer vision, which yielded 150x performance improvement. The generated drivers overhead is only 0.28% compared to the manually-written drivers.
Field-programmable gate arrays (FPGAs) based on SRAM cells are an attractive alternative for real-time system designers as they offer high density, low cost and high performance. The use of SRAM cells in the FPGA's configuration memory, while enabling these desirable characteristics, also creates a reliability hazard as RAM cells are susceptible to Single Event Upsets (SEUs). The usual approach is to use double or triple redundancy and some correction mechanism, such as periodic scrubbing. While scrubbing is an effective technique to remove SEU-induced errors, the repair of real-time systems present specific challenges, such as avoiding failures by missing real-time deadlines. In this paper, a novel approach is proposed to use a deadline-aware scrubbing scheme to choose dynamically the scrubbing starting position with negligible area costs. Such scheme allows us to avoid missing real-time deadlines while maximizing the repair probability given a bounded repair time. Our approach reduces the failure rate, considering the probability of missing deadlines due to faults, by 33.39 % on average, with an average area cost of 1.23 %.
The proper mapping of an application on a multi-core platform and the scheduling of its tasks is a key element in order to achieve the maximum possible performance. In this paper a novel hybrid approach based on integrating the Logic-Based Benders Decomposition (LBBD) principle with a pure Integer Linear Programming (ILP) model is introduced for mapping complex applications described by Directed Acyclic Graphs (DAGs) on platforms consisting of heterogeneous cores. The LBBD approach combines two optimization techniques with complementary strength, namely ILP and Constraint Programming (CP), and is employed as a cut generation scheme. The generated constraints are utilized by the ILP model to cut possible assignment combinations aiming at improving the solution or proving the optimality of the best-found one. The introduced approach was applied both on synthetic DAGs and on DAGs derived from real application. Through the proposed approach, some problems were optimally solved that could not be solved by any of the above methods alone (ILP, LBBD) within a time limit of two hours, while the overall solution time was also significantly decreased.
Complex electronic systems include multiple power domains and drastically varying dynamic power consumption patterns, requiring the use of multiple power conversion and regulation units. High frequency switching converters have been gaining prominence in the DC-DC converter market due to their high efficiency and smaller form factor. Unfortunately, they are also subject to higher process variations, and faster in-field degradation, jeopardizing stable operation of the power supply. This paper presents a technique to track changes in the dynamic loop characteristics of the DC-DC converters without disturbing the normal mode of operation using a white noise based excitation and correlation. Using multiple points for injection and analysis, we show that the degraded part can be diagnosed to take remedial action. White noise excitation is generated via pseudo random disturbance at reference, load current, and PWM nodes of the converter with the test signal energy being spread over a wide bandwidth, below the converter noise and ripple floor. The impulse response is extracted by correlating the random input sequence with the output disturbance. The cross-correlation the desired behavior over time and pulls it above the noise floor of the measurement set-up. A power converter, LM27402, is used as the DUT for the experimental verification. Experimental results show that the proposed technique can estimate converter natural frequency and Q-factor within ±2.5% and ±0.7% error margin respectively, over changes in load inductance and capacitance. For the diagnosis purpose, RDCR value, which is indicative of the degradation in inductors quality factor, is estimated within ±2% error.
The pharmaceutical supply chain is the pathway through which prescription and over-the-counter (OTC) drugs are delivered from manufacturing sites to patients. Technological innovations, price fluctuations of raw materials, as well as tax, regulatory, and market demands are driving change and making the pharmaceutical supply chain more complex. Traditional supply chain management methods struggle to protect pharmaceutical supply chain, maintain its integrity, enhance customer confidence, and aid regulators in tracking medicines. To develop effective measures that secure pharmaceutical supply chain, it is important that the community is aware of the state-of-the-art capabilities available to the supply chain owners and participants. In this article, we will be presenting a survey of existing hardware-enabled pharmaceutical supply chain security schemes and their limitations. We also highlight the current challenges and point out future research directions. This survey should be of interest to government agencies, pharmaceutical companies, hospitals and pharmacies, and any other bodies who care about the provenance and authenticity of medicines, and the integrity of pharmaceutical supply chain.
Voltage assignment is a well-known technique for circuit design, and it has been applied successfully to reduce power consumption in classical 2D integrated circuits (ICs). Its usage in the context of 3D ICs has not been fully explored yet although reducing power in 3D designs is of crucial importance, e.g., to tackle the ever-present challenge of thermal management. In this paper, we investigate the effective and efficient partitioning of 3D designs into multiple voltage domains during the floorplanning step of physical design. In particular, we introduce, implement, and evaluate novel algorithms for effective integration of voltage assignment into the inner floorplanning loops. Our algorithms are compatible not only with the traditional objectives of 2D floorplanning but also with the additional objectives and constraints of 3D designs, including the planning of through-silicon vias (TSVs) and the thermal management of stacked dies. The resulting 3D floorplanner is tested extensively on the GSRC benchmarks as well as on an augmented version of the IBM-HB+ benchmarks. The 3D floorplans are shown to achieve about 619% in power saving while simultaneously reducing critical delays, which results in an improvement of the power-delay product of approximately 1518%. Furthermore, our open-source, multi-objective floorplanning framework helps to ease the problem of thermal management as well, as we show empirically in our study.
There has been a surge of interest in Non-Volatile Memory (NVM) in recent years. With many advantages, such as density and power consumption, NVM is carving out a place in the memory hierarchy and may eventually change our view of computer architecture. Many NVMs have emerged, such as Magnetoresistive random access memory (MRAM), Phase Change random access memory (PCM), Resistive random access memory (ReRAM), and Ferroelectric random access memory (FeRAM), each with its own peculiar properties and specific challenges. The scientific community has carried out a substantial amount of work on integrating those technologies in the memory hierarchy. As many companies are announcing the imminent mass production of NVMs, we think that it is time to have a step back and discuss the body of literature related to NVM integration. This paper surveys state-of-the-art work on integrating NVM into the memory hierarchy. Specially, we introduce the four types of NVM, namely, MRAM, PCM, ReRAM, and FeRAM, and investigate different ways of integrating them into the memory hierarchy from the horizontal or vertical perspectives. Here, horizontal integration means that the new memory is placed at the same level as an existing one, while vertical integration means that the new memory is interleaved between two existing levels. In addition, we describe challenges and opportunities with each NVM technique.