Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/c/cc05ef69b72baae583f469c9815a7a35.i failed

Unable to save cache file. Hint: disk full; file permissions; safe_mode setting.

Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/c/cc05ef69b72baae583f469c9815a7a35.i failed

Unable to save cache file. Hint: disk full; file permissions; safe_mode setting.

Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/c/cc05ef69b72baae583f469c9815a7a35.xhtml failed

Assignment #1: CUDA Programming
Assignment #2: A Simple Filter
Assignment #3: Finding the Maximum
Project Information

Assignment #1: CUDA Programming

Write a CUDA program, which should be named “arradd”, that adds a number X to all elements of a one-dimensional array A. The elements of A and X should be single precision floating-point numbers. Using the CUT timer calls, have your program report the time needed to copy data from the CPU to the GPU, the time needed to add X to all elements of A in the GPU, and the time needed to copy the data back from the GPU to the CPU. The elements of A should be initialized so that A[i] = i / 3.0f.

Have your program vary the number of elements in A from 1M to the maximum number that can be supported by single invocation of a GPU kernel in power of two steps, i.e., 1M, 2M, 4M, *M, 16M, etc. For every different array size, have your program print three time measurements: the time required to copy A from the CPU to the GPU, the time taken by the kernel, and the time required to copy the data from the GPU to the CPU. Use the CUT timer calls we reviewed during the lectures. The call “cutGetTimerValue()” returns milliseconds as a single-precision floating-point number. The output of your program should look like:

Elements(M) CPUtoGPU(ms) Kernel(ms) GPUtoCPU(ms)

1 0.000000 0.000000 0.000000

Check the file “NVIDIA_CUDA_SDK/common/src/cutil.cpp” for a description of the CUT timer calls.

Then extend your kernel so that it accepts an extra argument that specifies how many times X should be added to each element. Do not use multiplication for these additions. Rather create a loop.

For the maximum number of elements that can be supported by a single kernel invocation have your program print out the three time measurements above as a function of the number of times X is added. Do so, for a range of 1 through 256 in power of two steps. Your program’s output should look as follows:

XaddedTimes Elements(M) CPUtoGPU(ms) Kernel(ms) GPUtoCPU(ms)

1 x 0.000000 0.000000 0.000000

What to hand in:

Submit the version that prints both measurements (i.e., time as a function of element count and time as a function of the number of additions). Put all your code into single CUDA file. Name the cuda file hw1.cu. Make sure it compiles and runs correctly. Submit your code through the Blackboard system. More information on the last step will be posted as soon as we figure out how to do this on Blackboard.

Have your program print your name at the beginning as follows:

FIRSTNAME:

LASTNAME:

Feel free to post questions on the course forum.

Assignment #2: A Simple Filter

The code provided contains typos that were pointed out in the discussion forum. Please check the forum and do not “trust” the code that I provided. My goal was to provide you with the skeleton for your application plus to take away the complexity of developing the code that uses OpenGL to display the image.

In this assignment you will have to develop a simple blur filter for monochrome images. The image will be provided as a two dimensional array of bytes. Every byte corresponds to a single pixel and takes values from 0 to 255. To get the blur effect the value of each output pixel will determined by the corresponding input pixel value plus all neighboring pixel values as follows. Say we want to calculate the output value for the red pixel in this 3×3 subsection of the image (shown are sample pixel values).

10 40 70 20 50 80 30 60 90

First we multiply each pixel with the corresponding element of a filter weight table:

1.0 2.0 1.0 2.0 3.0 2.0 1.0 2.0 1.0

And we get this intermediate result:

10 80 70 40 150 160 30 120 90

Then we sum all elements of this intermediate table:

10+80+…+90 = 750

And finally we divide the previous result with the sum of all the weights in the blur filter (15 in our case): 50. (Modified on Jan. 29, as per the suggestion of Tijmen posted on the discussion board.)

This becomes the new red pixel value.

How to develop this program

We are providing you with two source files (press the title link “Assignment 2” and you will see the files):

imagefilter.cu, and imagefilter_kernel.cu

The first file reads the input image (use ”-file FILENAME” – defaults to “bird.pgm”) and also displays it on your screen (you’ll have to do this on the lab) or on a machine that is CUDA capable.

Pressing ‘1’ inside the display window toggles the filter.

The filter itself is implemented in the second CUDA file. You’ll have to modify the function d_filter (uchar *d_output, uchar *d_input, uint width, uint height). In d_output you should write the filtered image. The input image is in d_input. The dimensions of the image are given by the other two parameters. These are two dimensional arrays. So element (x,y) is at distance y * width + x from the base of the array. The code as provided multiplies each pixel by 2. Change that portion to implement the blur filter.

You may also choose to modify the function execconfig_setup (void) that sets up the grid and block dimensions for subsequent invocations of the kernel d_filter. The two variables of interest are blockSize and gridSize.

Both the kernel and the execution configuration setup function are called from function render which is in imagefilter.cu.

What to hand in:

Modify the kernel file so that it implements the filter. Use the display version of imagefilter.cu to develop and test your code. Then switch to the non-display version of imagefilter.cu to measure the running time of your kernel. In this version of imagefilter.cu you will have to develop a CPU-side version of the filter and measure its running time. We provided templates for the host_filter () and pixelfilter () functions. Fill in the details. Also add code to compare the results produced by the GPU and the CPU. The output arrays are d_oimage and h_hoimage for the GPU and CPU respectively. The input arrays are d_image and h_image respectively as well.

You will have to hand in the modified imagefilter.cu (non-display version) file and the modified imagefilter_kernel.cu.

Have your program print your name at the beginning as follows:

FIRSTNAME:

LASTNAME:

Feel free to post questions on the course forum.

http://www.eecg.toronto.edu/~moshovos/CUDA08/assignments/imagefilter.cu: This version displays the image on your screen and allows you to see what your algorithm produces. Use it to develop and debug your code.
http://www.eecg.toronto.edu/~moshovos/CUDA08/assignments/imagefilter2.cu: This version does not display the image on your screen. Use it to measure running time and to validate the results with a CPU-side implementation. This contains typos as well (e.g., parameters passed to render are in the wrong order, copy to host uses the wrong GPU variable, etc.). Please check the discussion forum.
http://www.eecg.toronto.edu/~moshovos/CUDA08/assignments/Makefile
http://www.eecg.toronto.edu/~moshovos/CUDA08/assignments/imagefilter_kernel.cu

Assignment #3: Finding the Maximum

Write a CUDA program that finds the maximum among the elements of an array of N integers. Try to optimize your program for speed as much as possible.

Your program should first allocate an array of N integers and then assign random values to them. Have your program vary N to be 2M, 8M, and 32M (M=2^20).

For each N, find the maximum using the GPU and then do so using the CPU. Your program should report the average execution time per N for the GPU and the CPU. The execution time should exclude the overheads of memory allocation and of copying data between the GPU and the CPU. Average execution times over 10K runs (your program should do that internally – do not run the program 10K times).

The output of your program should look like this:

FIRSTNAME: YOURFIRSTNAME

LASTNAME: YOURLASTNAME

N 2M GPUmax: # CPUmax: # GPUtime: # CPUtime: # GPUSpeedup: #

N 4M GPUmax: # CPUmax: # GPUtime: # CPUtime: # GPUSpeedup: #

…

Where # are numbers. Report time in SECONDS using a 0.000000 format (six digits after the dot). GPUSpeedup is CPUtime/GPUtime.

Recall that threads from different blocks cannot coordinate. First develop an algorithm that uses just a single block and for N up to 512. Then think about how you could use multiple blocks. You are free to use multiple kernels.

What to hand in:

Put all your code into single CUDA file. Name the cuda file hw3.cu. Make sure it compiles and runs correctly. Write a very short description of your algorithm. You are free to use up to 500 words for your description. Please use plain text for this write-up (no .doc, .html, .pdf, etc.).

Put your hw3.cu and your hw3.txt file into a YOURLASTNAME.zip file and submit that through the Blackboard system.

Feel free to post questions on the course forum. If you don’t get a response within a reasonable amount of time (your decide what is that) please e-mail me or Hassan directly.

Project Information

As part of this course you are required to do a programming project. The goal is to implement an interesting algorithm in CUDA and optimize it as much as you can for performance. Ideally the algorithm will be solving an interesting research or practical problem. The ideal team will have four members where one of the team members is directly interested in using the CUDA implemented algorithm in their own research or work. The other team members will ideally be from different backgrounds and be experienced C/C++ programmers. Smaller teams are possible but are strongly discouraged. Larger teams will not be accepted unless, of course, you can offer a strong argument why having a larger team is necessary.

You will be evaluated on how well you described the problem you are trying to solve using CUDA and how well you document the progress you made in the CUDA implementation. Ultimately, the most important aspect of the evaluation is how well you document what you did, why you did it, what was the outcome, and what would be the next possible steps.

Timeframe

There are five milestones. First you are encouraged to meet with me and discuss your plans. This is not mandatory but it is strongly encouraged. Second, you should submit a short description of the proposed project . Mid-way through your project you will have to submit a written update on your project’s progress. At the end of the course you will have to give a 10-minute presentation. Finally, you have to submit a final project report. Here are the dates you should remember:

          Friday, March 6th                               Stating at 2pm in EA311
                                                                      Meet with me and discuss your plans

          Monday, March 9                               Project Proposal due on-line by 23:59pm EST
                                                                              NO LATE PROPOSALS WILL BE ACCEPTED.

          Monday, March  23st                          Project Progress report due on-line by 23:59pm EST
                                                                              NO LATE REPORTS WILL BE ACCEPTED.

          Week of April 13th                             Project Presentations

          Thursday, April 16th                          Final Project report due on-line by 11:59am EST
                                                                      NO LATE REPORTS WILL BE ACCEPTED.

PROPOSAL FORMAT

The proposal should be at most 1000 words, plus references as needed. Submit an HTML file (hint: Microsoft Word and most WYSIYG editors can produce HTML files).

Please explain in the following order:

1. Introduction/Motivation: This should be written so that someone that is not necessarily in your field can appreciate what the applications are and why this is an important problem.

2. Topic, i.e., what are you going to do.

3. Methodology: How will you measure performance? How will you measure numerical accuracy? In general, try to address the question “how would one know how well your program works”.

4. Goal: What you promise to deliver at the end. Feel free to say what else you might be able to do, if for whatever reason you manage to reach the goals earlier than expected.

Remember: while these may evolve or change you are required to start with a meaningful plan. No point starting something if you cannot articulate why this might be interesting or doable.

Do not feel obliged to submit close to 1000 words. Conciseness will be greatly appreciated.

If you prefer that we do not post your code and presentation please state so. Feel free to use this course to develop code that you can directly used in your work/research/

PROGRESS REPORT FORMAT

500 words less. What you have done so far. Difficulties faced. Any changes in your plan.

FINAL PROJECT REPORT

Try to limit this to at most 4000 words. (Remember: it’s easy to write a lot of text. It’s hard to write concise.) An approximate format is:

1. Introduction: Motivation and Problem statement. BRIEFLY. Also conclude with forecasting what your method and most important results are.

2. Expanded motivation if needed.

3. Related Work. It is your responsibility to find and report any related work.

4. Explanation of the algorithm and then of its implementation.

5. Methodology. summary of your metrics and WHY are you using them, plus description of the experiments.

6. Evaluation. One by one the results.

7. Conclusions. “I am so good I cannot stand myself. Maybe I should donate my brain to science” put in less obvious terms. Future directions (not left or right).

Back to top

lab_assignments.txt · Last modified: 2009/05/04 21:22 by yperxristis