Research, while at Synopsys

Hardware-Software Co-Design of Embedded Reconfigurable Architectures (DAC 2000)

By Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, and J. Stockwood


In this paper we describe a new hardware/software partitioning approach for embedded reconfigurable architectures consisting of a general-purpose processor (CPU), a dynamically reconfigurable datapath (e.g. an FPGA), and a memory hierarchy. We have developed a framework called Nimble that automatically compiles system-level applications specified in C to executables on the target platform. A key component of this framework is a hardware/software partitioning algorithm that performs fine- grained partitioning (at loop and basic-block levels) of an application to execute on the combined CPU and datapath. The partitioning algorithm optimizes the global application execution ime, including the software and hardware execution times, communication  time  and  datapath  reconfiguration  time. Experimental results on real applications show that our algorithm is effective in rapidly finding close to optimal solutions. 

Note: due to copyright restrictions, I cannot provide a copy of this on-line, however it is available thru ACM

Research, while at Rice as a student

Local (Software) Cache Coherence

Cache Coherence Using Local Knowledge (thesis)


Hiding memory latency is critical in modern machines. Typically, machines have used cache and addressed the ensuing cache coherence problem with hardware or VM-based strategies that rely on global inter-cache communication. However, global communication limits scalability. "Local knowledge" coherence strategies, which use compile-time information to avoid run-time global communication, offer better scalability, but suffer additional cache misses. We develop a framework for understanding the relation of coherence strategies, previous and newly proposed. Within this framework, it is possible to define, independent of implementation considerations, an "ideal" local strategy with respect to cache hit rate. No local strategy could ever do better. For Fortran programs with readily analyzable subscripts, ideal local strategies achieve the same hit rates as global strategies.

We develop three new local coherence strategies, CTV, TS1, and TS', designed to ex-ploit minimal, aggressive, and reasonable hardware support, respectively. CTV is suitable for machines with no hardware assistance for cache coherence except the bare minimum of an exposed invalidate instruction. TS1 implements the abstract theorems of ideal local coherence as a concrete algorithm. Though the implementation is probably too expensive for a real implementation, TS1 is a vehicle for studying the limits of local coherence. TS' treats coherence over array sections as a graph coloring problem. So long as there are sufficient colors (realized as bits per cache line), TS' is an ideal local strategy. We found that four colors are adequate for many programs. When more colors are needed, TS' degrades gracefully. Its execution overheads are negligible and its hardware implementation costs moderate.

Our data shows that TS' has better hit rates than the best previous local strategy, time-stamping, for nearly all programs, and thus better expected performance. Our data also shows that TS' achieves hit rates equal to global strategies for analyzable programs, and nearly so for partially analyzable programs. We indirectly compared the performance of TS' and a particular VM-style global strategy. TS' has better expected performance on our test suite. For machines without global coherence hardware, local strategies are an effective approach for an important class of programs.

Automatic Software Cache Coherence Through Vectorization (ICS '92)

by Ervan Darnell, John Mellor-Crummey, and Ken Kennedy
On shared-memory multiprocessors, caches for each processor must be kept consistent, i.e. have the same view of main memory. This is expensive to maintain in hardware. For machines which provide neither cache coherence nor local hardware assitance, the compiler can produce programs which are guaranteed to be coherent. This paper describes an approach relying on the notion of 'vectorzing' invalidates and updates. Experiments show that it compares favorably to similar methods. Significant speed-ups were found on a BBN TC2000.

Cache Coherence Using Local Knowledge (Supercomputing '93)

by Ervan Darnell and Ken Kennedy
Coherence hardware has a global nature, that is each cache must communicate with each other at run-time to maintain coherence. Strategies for effecting this suffer significant time or storage overhead. Instead, compile-time directed decisions plus some local run-time cache knowledge gathered using special hardware can achieve hit rates nearly as good as global strategies. Local strategies suffer no network or contention delays to perform coherence. Additional storage cost is also minimal. This paper presents an algorithm that is ideal in the sense that no local strategy could ever achieve a higher hit rate, for a given level of compiler analysis.

PFC Enhancements (Parallel Fortran Converter)

Loop Fusion in PFC (Supercomputer Software Newsletter #17, CRPC)

This describes a code enhancement in PFC parallel code generator to fuse loops where logically possible, consistent with dependence patterns. This saves on synchronization overhead, exposes additional optimization possibilities, and improves cache utilization.

Sum Reduction in PFC (Supercomputer Software Newsletter #13, CRPC)

This describes an enhancement to the PFC vector code generator that expands the class of statements which can be recognized as special reduction idioms and what dependence patterns indicate those reductions.  

Other Papers

An Empirical Exploration of the Poincaré Model for Hyperbolic Geometry (Winter '93 Journal of Mathematics & Computer Education)

by Joel Castellanos, Joe Dan Austin, and Ervan Darnell
This paper describes the implementation and application to teaching of a program that allows the user to draw objects in a hyperbolic space and then ask questions about geometric properties. Many common Euclidean theorems are still true in surprising ways. Others are not applicable and thus the user is challenged to understand the theorems at a deeper level and appreciate proper notions of proof.


Examples, software, and additional information on above paper.