DE2 Verilog examples
ECE 576 Cornell University

I used these examples to help teach myself Verilog. They are in the order I did them. The later examples are more fluent.
List of projects included on this page:

  1. Simple FPGA i/o
  2. External SRAM interface
  3. Kraken 16-bit cpu
  4. A tiny, single accumulator, CPU, the uP3
  5. VGA interface: color generation, diffusion-limited aggregation
  6. Audio Codec interface: DAC output and ADC input
  7. M4K memory block timing test

  1. FPGA I/O
    This simple example defines: The example was built mostly to understand the FPGA I/O pin assignments and the compilation/synthesis procedure. The whole QuartusII project is zipped here.
  2. External SRAM interface
    This example exercises external 61LV25616 SRAM by:

  3. Kraken 16-bit cpu
    This example is a simple 16 instruction ISA cpu with LED and switch i/o. The implemented datapath and timing diagram are useful to understand the Verilog. There is a picture of the board displaying instruction address PC=02 which contains 16'h8104, which is the instruction LI r1,4 (load-immediate register1 with value 4). This cpu is mostly intended for me to teach myself Verilog in an Altera context.
    Features include:
    Program:
       assembler        instruction memory
       LI r0, 1         8001 ;need to NOP first inst out of reset
       LI r0, 1         8001
       LI r1, 4         8104
       SUB r1 ,r1, r0   1110
       BNZ r1, -1       C1FF ;PC update timing implies that this jump is to the SUB
       JMP -3           E0Fd ;This jump is to the second LI
    A short mpeg of this program executing. The finger entering the frame from the lower right is running the clock. The blinking green LED is illuminated during the FETCH state. The left-most 2 digit 7-seg display is showing the PC. The 4 digit 7-seg display is showing the instruction being fetched/executed. The program loops through the subtract 4 times, then jumps back, reloads the counters and down-counts again.

    A possible variant is a simple cpu with i/o ports and a small ISA aimed at DSP. The implemented datapath and timing diagram are useful to understand the Verilog.

    Features might include:

  4. A tiny CPU the uP3 (from Hamblen, Rapid prototyping of digital systems--SOPC Edition , chapter 9, Springer 2008)
    This example is a simple, one accumulator, cpu which could be hacked for parallel processing since it requires only one M4k block for data/program and uses only a few hundred logic elements. Perhaps by adding a few i/o ports, a multiply instruction, a limiter, and a few more instructions you could have a useable, programmable DSP block. The M4k block mif file is loaded with the machine code and initial data. This version exposes internal cpu busses for debugging, but a usable version would not (see modified version below). The assembler test program uses the output port to count on 4 digits of the hex LED display on the DE2 board. The actual assembler was written in matlab. The assembler input file and the resulting mif file is shown below. The first two digits of the memory content is the opcode (e.g. at location 00 the LOAD is 02) the second two digits is the address (hex 10) to be loaded into the accumulator.
    assembler source resulting MIF file
    ;define section
    define
     	LEDs 00
    
    ; data section
    data 16	; base address
    	; name length value(optional)
     	initA	1 
     	incr 	1	1
     	outval 	1 
    
    ;code section
    code
    ; label opcode	address
    init:	load 	initA
    loop:	add 	incr
         	jneg 	skip
         	jump 	loop
    skip:	load 	outval
         	add 	incr
         	out 	LEDs
         	store 	outval
         	jump 	init
    DEPTH = 256;
    WIDTH = 16;
      
    ADDRESS_RADIX = HEX;
    DATA_RADIX = HEX;
       
    CONTENT
    BEGIN
    [00..FF]	:	0000;
    00	:	0210;	% init load initA % 
    01	:	0011;	% loop add incr % 
    02	:	0404;	%  jneg skip % 
    03	:	0301;	%  jump loop % 
    04	:	0212;	% skip load outval % 
    05	:	0011;	%  add incr % 
    06	:	0500;	%  out LEDs % 
    07	:	0112;	%  store outval % 
    08	:	0300;	%  jump init % 
    10	:	0000;	% initA  % 
    11	:	0001;	% incr  % 
    12	:	0000;	% outval  % 
    END ;	
    

    The entire project (including mif file) is here. Adding a PLL to speed up CLOCK_50 allows the uP3 to run at 150 MHz with no timing errors reported. Running at 200 MHz caused the timing analyser to report errors, but the cpu still ran. Running at 250 MHz caused the cpu to fail.

    A slightly modifed version has two cpus instantiated, running two different assembler codes (for cpu0 and cpu1). The cpu1 code increments the hex display 4 times as fast as the code for cpu0.The mif file names for the two separate program memory contents are specified at the top level using a separate defparam module as shown below. The entire project is zipped here.
    module annotate;
    defparam
    	DE2_TOP.cpu0.altsyncram_component.init_file = "TestPgm0.mif",
    	DE2_TOP.cpu1.altsyncram_component.init_file = "TestPgm1.mif";
    endmodule
    Another slightly modifed version has three cpus instantiated, running three different assembler codes. The cpu1 code increments the hex display 4 times as fast as the code for cpu0. The cpu2 runs a copy of the same code as cpu0, but uses one bit if its output to alternatively hold each of the other two processors in reset, so that the two cpu counts alternate as shown in the video.

  5. VGA examples
    1. The first example displays a 320x240 image on a 640x480 raster using external SRAM. The color depth is 12 bits/pixel (4 bits/primary/pixel). When first powered up, SRAM contains random bits. Pressing KEY1 writes a 20x15 grid of colors to memory. Holding KEY2 while pressing KEY1 write a single color to SRAM, as determined by the upper 12 bits of switches. SW[15:12] is red intensity, SW[11:8] is green, SW[7:4] is blue. The VGA driver, PLL, and reset controller from the DE2 CDROM are necessary to compile this example. All the files are in this zip. An image is below. There is a dim scan bar across the middle of the frame caused by the digital camera.
    2. The second example displays a 640x480 image on a 640x480 raster with a color depth of 8 bits/pixel. The color map is shown below. There are 16 levels of green (vertically), 4 levels of red and 4 levels of blue. Both red and blue increase in intensity from left to right (except for the last four columns on the right where there is no blue). In the first 4 columns, blue intensity is zero, becoming full intensity in columns 13-16. You could build a color lookup table to produce more distinct colors than the linear set displayed here.
    3. The third VGA example generates a 320x240 diffusion-limited-aggregation (DLA). A DLA is a clump formed by sticky particles adhering to an existing structure. In this design, we start with one pixel at the center of the screen and allow a random walker to bounce around the screen until it hits the pixel at the center. It then sticks and a new walker is started randomly at one of the 4 corners of the screen. The random number generators for x and y steps are XOR feedback shift registers (see also Hamblen, Appendix A). The VGA driver, PLL, and reset controller from the DE2 CDROM are necessary to compile this example. All the files are in this zip. A short mpg (1.3 Mbyte) shows the initial stages of the simulation. The first image below is at an earlier time in the simulation than the second image. When the structure is larger, there is a bias to grow toward the corners because that is where new walkers were released. The program is structured as a state machine which processes walker updates only during vertical or horizontal sync intervals when external SRAM is not needed to update the screen. Note that you must push KEY0 to start the state machine.


      A minor change in the design results in colors. The 12-bit color vector (see VGA example 1 above) was just decremented for every new walker. Since the ordering is high bits for red, mid bits for green, low bits for blue, the color starts at white, fades to yellow, then red, then cycles in a complicated fashion.



      Another minor modification of the design results in a "forest". In this case, the entire bottom of the screen was the starting seed and the walker release sites were uniformly distributed across most of the top of the screen The second image below shows the simulation after it has reached the top of the screen. At this point, every walker is immedately immobilized and growth stops as shown in the second image following.

      This design version has several states merged to take advantage of hardware parallelism. Memory access is, of course, serial. The walker release sites were modifed to cover approximately the top two-thirds of the screen.
  6. Using the Audio Codec
    DAC only:
    The audio codec supports ADC conversion of microphone or line inputs into the FPGA, and DAC conversion of digital audio from the FPGA to line out. This example shows how to set up playback from the FPGA to line out. Playback is supported from flashRAM, SDRAM, SRAM or by direct on-the-fly synthesis. This example uses direct digital synthesis (DDS) of a sine wave. The DDS sinewave ROM table is computed by a matlab program. In this mode, the codec requires 16-bit, 2's complement samples for each channel. The matlab actually generates 15-bit samples because the gain of the headphone output was turned up too high and caused clipping.

    The logical structure of the hardware driver:

    ADC and DAC:
    This example uses a modifed interface to loop the the stereo ADC input through a digital filter and back to the output DAC. The whole project is zipped here. The project includes a SignalTap debug interface to show ADC data input. In the SignalTap data-capture shown below, the ADCLRCK signals right-channel when low and left-channel when high.

    The top_level module was modifed slightly to simplify the interface to the new Audio_DAC_ADC module. The I2C_AV_config module LUT was modified to turn on the ADC with line-input. Note that one word of dummy data must be sent across the I2C interface before the first actual setup parameter. The central part of the Audio_DAC_ADC module is shown below. Data is clocked in/out on the negative edge of AUD_BCLK. The 16-bit, serial data is 2's complement, MSB first. The LRCK_1X signal represents the locally generated ADCLRCK.
    always@(negedge oAUD_BCK or negedge iRST_N)
    begin
    	if(!iRST_N) SEL_Cont <= 0;
    	else
    	begin
    		SEL_Cont <= SEL_Cont+1; //BIT SELECTOR: 4 bit counter, so it wraps at 16
    		if (LRCK_1X) //ADCLRCK
    			AUD_inL[~(SEL_Cont)] <= iAUD_ADCDAT;
    		else
    			AUD_inR[~(SEL_Cont)] <= iAUD_ADCDAT;
    	end
    end
    // output the DAC bit-stream						
    assign oAUD_DATA = (LRCK_1X)? AUD_outL[~SEL_Cont]: AUD_outR[~SEL_Cont] ;														
    // Filter the input sample and register output	
    always@(negedge LRCK_1X ) //oAUD_LRCK
    begin
    	// 1-pole lowpass at about 200 Hz. The >>> operator is SIGNED shift right
    	AUD_outR <= (AUD_inR>>>6) + AUD_outR - (AUD_outR>>>6); 
     	// just pass through left channel
    	AUD_outL <= AUD_inL ; 
    end 


  7. M4K memory block timing test
    The CycloneII FPGA has dedicated memory blocks. Timing reads/writes seemed to be an issue in some designs, so I decided to write some Verilog to test various clocking methods, write through versus no write through, and pure Verilog (from the Altera HDL style manual) versus Megafunction memory definitions (which generate Verilog). The six combinations that I tried were:
    1. Memory clocked on negative edge, with write through
    2. Memory clocked on negative edge, without write through
    3. Memory clocked on positive edge, with write through
    4. Memory clocked on positive edge, without write through
    5. Memory block megafunction without registered output, clocked on negative edge
    6. Memory block megafunction without registered output, clocked on positive edge
    The first four were from the Altera HDL style manual, the last two from the MegaWizard Plugin Manager Verilog generator. In all cases the state machine which reads/writes memory is clocked on the positive edge of the same clock. When changing versions of QuartusII, you may need to open the altsyncRAM megafunction (using the appropriate wizard) and regenerate it. The top level module latches the data from memory one cycle after the read/write address is asserted and displays a pass/fail on two LEDs. LEDR[15] lights if read-after-write is correct in one cycle, while LEDR[17] lights if the read is correct in one cycle. Only configurations 1 and 5 pass both tests. The whole project is zipped here.

References

JO Hamblen, TS Hall and MD Furman, Rapid protoyping of digital systems, Springer 2005

JO Hamblen, TS Hall and MD Furman, Rapid protoyping of digital systems: SOPC edition , Springer 2008