Home Page of Fall 2024 EE685

This is the home page for EE685. There is only one section, meeting TR 3:30-4:45PM in room 267 FPAT.

Although many course materials will be posted here, the course Canvas site will be the primary place for course announcements. You are expected to use Canvas and materials linked here. The best way to schedule office meetings, physical or virtual, is via email with "EE685" in the Subject line.

The precise course content of EE685 has varied quite a bit over the years. However, in general, it has tried to be a step beyond the material covered in the CPE380-CPE480 sequence, and so it will be this semester. What that means is a little different from past offerings because CPE380 and CPE480 have both seen major upgrades and internal restructurings over the past two years. In particular, CPE380 is now fairly Verilog intensive and students who took CPE380 in previous years, or who took a similar course elsewhere, might not have that Verilog exposure. Thus, some of the material in this EE685 overlaps the latest (Fall 2024) CPE380 coverage to ensure that students are comfortable with Verilog and the basic CPE380 concepts. Even in the review-like portions this course introduces some more advanced concepts, and it quickly moves to broader and deeper material.

The following list of reference materials below will grow dramatically as the course progresses. All links below the horizontal rule have yet to be updated for Fall 2024 and are likely to change significantly.

The syllabus, which has been updated for Fall 2024.
The introductory slides
An overview of Verilog
- Icarus Verilog (iverilog and vvp) is the primary tool we'll be using for compiling and simulating Verilog code. It is actually part of gEDA. Note that you can install it on Ubuntu Linux systems by simply selecting it in the Software Center or Synaptic -- it's a standard part of the Ubuntu distribution, as well as having been ported to Windows, etc.
- Icarus Verilog Simulator CGI Interface created by Professor Dietz and used for Verilog code throughout this course
- EDA playground is an alternative WWW interface for running Icarus Verilog... and various other tools including some commercial simulators. Requires Log In for use, but registration is free.
- ASIC World has a multitude of really nicely prepared materials showing how to use Verilog
A really fast review of CPE380/480 up to the point of a simple pipelined Verilog implementation of MIPS
Slides introducing a simple MIPS-based SIMD instruction set
- Activity Counter Implementation Of Enable Logic (PDF). This paper describes a clever method for handling tracking of nested SIMD enable/disable without use of a bit stack; the MIPS-based SIMD uses this trick.
Better pipeline management slides
ALU stuff (still being extended)
- About transcendental functions: CoRDiC (Coordinate Rotation Digital Computer) algorithms were popular for calculators to use to perform floating-point arithmetic because, compared to truncated Taylor series, they don't require lots of hardware. They fell out of favor as fast floating-point hardware became more common. The catch is, lots of FPGAs can't easily provide lots of fast floating-point hardware, so CORDIC algorithms have become popular for use in FPGAs. Here are a couple of easy-to-understand explanations of CORDIC: CORDIC for Dummies and Implementing Cordic Algorithms.
- Unums and Posits are described in Beating Floating Point at its Own Game: Posit Arithmetic, and there are also slides Beyond Floating Point: Next-Generation Computer Arithmetic giving a simpler overview
- The bfp - Beyond Floating Point C/C++ library implements Posits
- I describe a table-lookup implementation of 8-bit posit arithmetic in my Gr8BOnd processor implementation materials originally made for CPE480
- My implementation of 16-bit LNS (log number system) arithmetic
Memory system overview slides; note that we'll talk more about memory coherence a bit later when discussing parallel machines
Scalable supercomputing overview slides
Papers describing basic (pronounced "old") SIMD (Single Instruction stream, Multiple Data stream) stuff. Notice that traditional SIMD is often bit-serial and extremely simple per processing element.
- Architecture of a massively parallel processor (PDF). This paper describes Ken Batcher's SIMD MPP design at Goodyear Aerospace.
- DAP -- a distributed array processor (PDF). This paper describes the ICL DAP, another early SIMD machine.
- Thinking Machines CM-2 (PDF). A (relatively late) version of the "Connection Machine Model CM-2 Technical Summary, Version 6.0, November 1990." This includes description of the (CM-200) floating-point hardware to the design.
- KY Architecture Nanocontrollers. This was really introduced in Much Ado about Almost Nothing: Compilation for Nanocontrollers (slides, full paper). This is essentially answering the question: how simple can a SIMD PE be?
- A Quantum-Inspired Model For Bit-Serial SIMD-Parallel Computation (slides, full paper). There is also the Tangled processor design from CPE480 Fall 2020. For now, we'll ignore the quantum stuff; it's a bit-serial SIMD with a few interesting twists.
Basic SWAR Architecture & Concepts
- One of the first talks on the concepts of SWAR was Multimedia Extensions For Microprocessors: SIMD Within A Register (HTML, PDF), which I originally presented in February 1997 at Purdue University. The HTML is a little ugly, but this is the original HTML, and the server it was on supported different server-side processing....
- One of the best generic descriptions of the concepts of SWAR was Compiling for SIMD within a Register (PDF), which is linked from Springer-Verlag using UK's EZProxy access
An Introduction to Modern GPU Architecture, a very nice, if old, set of overview slides from NVIDIA
Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible using warp shuffle, and lots of optimized functions are now available using CUB
The atomic primitives are described in this section of the CUDA-C programming guide. Here are slides from NVIDIA overviewing their use.
A bit about AIK (Assembler Interpreter from Kentucky). This assembler-interpreter was created explicitly for EE480 by H. Dietz. It was built using the PCCTS/Antlr tools (which were created by Dietz's group back in the early 1990s). Given a very concise specification of an assembly language, it interpretively implements a relatively full-featured assembler.
- AIK user manual: PDF
- AIK CGI executable: WWW form interface
- AIK specification of EE380 MIPS: aikmips
Papers describing MIMD (Multiple Instruction stream, Multiple Data stream) stuff.
- We just completed SIMD, so here's how to make MIMD code execute on SIMD (GPU) hardware. I've done a lot of this; see here. Slides overviewing how this works are also available.
- Henry G. Dietz and Thomas Schwederski, Extending Static Synchronization Beyond SIMD and VLIW, Purdue University School of Electrical Engineering, Technical Report TR-EE 88-25, June 1988. (local PDF, PDF).
- Repetition Filter Memory in CHoPP: A. Klappholz. (1981). IMPROVED DESIGN FOR A STOCHASTICALLY CONFLICT-FREE MEMORY/INTERCONNECTION SYSTEM.. 443-448. Paper presented at Conf Rec Asilomar Conf Circuits Syst Comput 14th, Pacific Grove, CA, USA. (still looking for copy of this or other relevent article...)
- Fetch-&-Add in the NYU Ultracomputer: A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultracomputer -- Designing an MIMD Shared Memory Parallel Computer" in IEEE Transactions on Computers, vol. 32, no. 02, pp. 175-189, 1983. doi: 10.1109/TC.1983.1676201 (URL, local copy)
- "An Overview of the NYU Ultracomputer Project (1986)" (PDF) is a better, but more obscure, reference
- Explanation of the "Hot Spot" problem for RP3: G. F. Pfister and V. A. Norton, "``Hot spot'' contention and combining in multistage interconnection networks," in IEEE Transactions on Computers, vol. C-34, no. 10, pp. 943-948, Oct. 1985. (URL, local copy)
- Memory consistency models: "Shared Memory Consistency Models: A Tutorial" (PDF) -- Sarita Adve has done quite a few versions of this sort of description
- Modern atomic memory access instructions: AMD64 atomic instructions
- Transactional Memory has been a hot idea for quite a while. Intel's Haswell processors incorporate a hardware implementation described in chapter 8 of this PDF (locally, PDF); but there were (still are) problems.
- Replicated/Distributed Shared Memory: A very odd one is implemented in AFAPI as Replicated Shared Memory
- Classical DSM: The best known is Treadmarks, out of Rice University. One of the latest is DEX: Scaling Applications Beyond Machine Boundaries, which is part of Popcorn Linux

Course Staff

Professor Hank Dietz would normally be in the Davis Marksbury Building; see his home page for complete contact info. Regular Zoom office hour times will soon be listed there. He has an "open-door" policy that whenever his door is open and he's not busy with someone else, he's available -- and yup, there really is a slow-update live camera in his office so you can check. However, during the pandemic things are far less certain, and you should wear a mask if meeting in his office. The best method to contact him is to email hankd@engr.uky.edu using "EE685" in the subject line for anything related to this course. If appropriate, individual Zoom meetings also can be scheduled via email.

Digital Computer Structure.