Project CS3103 Operating Systems Version 1.0
1 CS3103 – Operating Systems
Project: Parallel Zip
I. Project Instructions
Overview
In the earlier programming assignment, you implemented a simple compression tool based on run-length
encoding, known simply as czip. For this project, you’ll implement something similar, except you’ll use threads
to make a parallel version of zip. We’ll call this version pzip.
There are three specific objectives to this project:
• To familiarize yourself with the Linux pthreads package for threading.
• To learn how to parallelize a program.
• To learn how to program for performance.
Due Date
Tuesday, November 26, 2019 at 11:59 PM. Late submissions will be penalized as per syllabus.
Handing In
The project can be done individually, in pairs, or in groups, where each group can have a maximum of three
members. All students are required to join one project group in Canvas (“People” section > “Groups” tab >
Project Group). Self sign-up is enabled for these groups. Instead of all students having to submit a solution to
the project, Canvas allows one person from each group to submit on behalf of their team. Please be sure to
indicate who is in your group when submitting the project report.
Before you hand in, make sure to add the requested identifying information about your project group, which
contains project group number, full name and e-mail address for each group member.
When you’re ready to hand in your solution, go to the course site in Canvas, choose the “Assignments” section >
“Project” group > “Project” item > “Submit Assignment” button and upload your files, including the following:
1) A Microsoft Word or Adobe PDF document which concisely describes your design,
implementation, and experimental results;
2) The source file, i.e., pzip.c;
In your project report, please describe the main techniques or mechanisms proposed to parallelize the
compression, list and describe all the functions used in this project, summary and analyze the results. You
should also discuss any difficulties encountered, what was learned. If you are working in a team, describe each
team member’s contribution.
2 CS3103 – Operating Systems
Your code should be nicely formatted with plenty of comments. Each function should be preceded by a
header comment that describes what the function does. The code should be easy to read, properly indented,
employ good naming standards, good structure, and should correctly implement the design. Your code
should match your design.
Academic Honesty
All work must be developed by each group separately. Please write your own code. All submitted source
code will be scanned by anti-plagiarism software. If the code does not work, please indicate in the report
clearly.
Questions?
If you have questions, please first post them on Canvas so others can get the benefit of the TA’s answer.
Avoid posting code that will give away your solution or allow others to cheat. If this does not resolve your
issue, contact the TA (Mr. CHEN Peilin <[email protected]>), or come to office hours.
Acknowledgements
This project is taken (and modified) from the OSTEP book written by Remzi and Andrea Arpaci-Dusseau at
the University of Wisconsin. This free OS book is available at http://www.ostep.org. Automated testing scripts
are from Kyle C. Hale at the Illinois Institute of Technology.
Disclaimer
The information in this document is subject to change with notice via Canvas. Be sure to download the latest
version from Canvas.
3 CS3103 – Operating Systems
II. Project Description
For this project, you will implement a parallel version of zip using threads. First, recall how zip works by
reading the description in Assignment 1 (Part II). You’ll use the same basic specification, with run-length
encoding as the basic technique.
Your pzip will externally look the same as czip. However, internally, the program will use POSIX threads
to parallelize the compression process. The general usage from the command line will be as follows:
$ ./pzip file.txt > file.z
Doing so effectively and with high performance will require you to address (at least) the following issues:
• How to parallelize the compression. The central challenge of this project is to parallelize the
compression process. Think about what can be done in parallel, and what must be done serially by a
single thread, and design your parallel zip as appropriate. For example, does it possible to zip several
small files using multiple threads instead of a large file using only one thread (czip)? If it’s possible,
how to divide the large file and merge the zip result of several small files? One interesting issue that
the “best” implementations will handle is this: what happens if one thread runs more slowly than
another? Does the compression give more work to faster threads? This issue is often referred to as
the straggler problem.
• How to determine how many threads to create. On Linux, this means using interfaces like
get_nprocs() and get_nprocs_conf(); read the man pages for more details. Then, create an
appropriate number of threads to match the number of CPUs available on whichever system your
program is running.
• How to efficiently perform each piece of work. While parallelization will yield speed up, each
thread’s efficiency in performing the compression is also of critical importance. Thus, making the
core compression loop as CPU efficient as possible is needed for high performance.
• How to access the input file efficiently. On Linux, there are many ways to read from a file, including
C standard library calls like fread() and raw system calls like read(). One particularly efficient
way is to use memory-mapped files, available via mmap(). By mapping the input file into the address
space, you can then access bytes of the input file via pointers and do so quite efficiently.
To understand how to make tackle these problems, you should first understand the basics of thread creation,
and perhaps locking and signaling via mutex locks and condition variables. Review the tutorials and read the
following chapters from OSTEP book carefully in order to prepare yourself for this project.
• Intro to Threads
• Threads API
• Locks
• Using Locks
• Condition Variables
4 CS3103 – Operating Systems
III. Project Guidelines
1. Getting the Starter Code
The project is to be done on the CSLab SSH gateway server, to which you should already be able to log in.
As before, follow the same copy procedure as you did in the previous tutorials to get the starter code. The start
code is available at /public/cs3103/project on the gateway server. It contains the following files/directories:
/public/cs3103/project
├── Makefile
├── pzip.c ├── README.txt └── tests ├── bin | <- modify and hand in pzip.c file |
│ │ │ | ├── generic-tester.py ├── serialized_runner.py └── test-pzip.csh |
└── tests-pzip
├── 1.binary
├── 1.desc
├── 1.err
├── 1.in
├── 1.out
├── 1.prep
├── 1.rc
├── 1.run
├── …
└── 9.timeout
3 directories, 81 files
Start by copying the provided files to a directory in which you plan to do your work. For example, to copy
files in /public/cs3103/project directory and its contents to a new pzip directory, change to it, and take a look
at the directory contents, enter:
$ cp -rf /public/cs3103/project pzip
$ cd pzip
$ ls
5 CS3103 – Operating Systems
2. Writing your pzip program
The pzip.c is the file that you will be handing in and is the only file you should modify. Write your code from
scratch or simply borrow the code from your czip to implement this parallel version of zip. Again, it’s a good
chance to learn (as a side effect) how to use a proper code editor such as vim12.
NOTES: CS3103 has some pre-cursors, including computer programming in C++ or Java, which equip
students with essential programming skills. So it should be fine if you have learned about these courses. Our
previous programming assignment is meant to get you warmed up with programming in the C/Linux
environment. Realize the best thing you can do to learn to program in any environment is to program a lot.
3. Building your program
A simple makefile that describes how to compile pzip is provided for you. To compile your pzip.c and to
generate the executable file, use the make command within the directory that contains your project. It will
display the command used to compile the pzip.
$ make
gcc -Wall -Werror -pthread -O pzip.c -o pzip
Note that the -Werror compiler flag is specified. It causes all warnings to be treated as build errors. It would
be better to fix the compiling issue instead of disabling -Werror flag.
If everything goes well, there would an executable file pzip in it:
$ ls
Makefile pzip pzip.c README.txt tests
If you make some changes in pzip.c later, you should re-compile the project by running make command again.
To remove any files generated by the last make, use the make clean command.
$ make clean
rm -f pzip
$ ls
Makefile pzip.c README.txt tests
So you can see that the executable ‘pzip’ was deleted.
4. Testing your C program
We also provide 9 test cases for you to test your code. You can find them in the directory tests/tests-pzip/.
1 Interactive Vim tutorial, https://www.openvim.com/
2 Vim Cheat Sheet, https://vim.rtorr.com
6 CS3103 – Operating Systems
• Test case 1-5: Test cases 1, 2, 4, and 5 are taken from czip/tests in Assigment 1 without any
modification. For test case 3, if no files are specified, the program should exit with return code 1 and
print “pzip: file1 [file2 …]” (followed by a newline).
Test case 1) Test case 2) Test case 3) Test case 4) Test case 5) | basic test – some ‘a’ characters multiple files on command line no files (error) multi-line file with some longer lines does compression always compress? |
• Test case 6-9: Four large input files with some compressability.
Test case 6) Test case 7) Test case 8) Test case 9) | 100 MB file (2 sec timeout) 150MB files (2 sec timeout) 200MB files (2 sec timeout) 250MB files (2 sec timeout) |
Each test consists of 7~9 files with different filename extension:
• n.binary: Indicates the data type of input and output is binary
• n.rc: The return code the program should return (usually 0 or 1)
• n.out: The standard output expected from the test
• n.in: The input file of the test case
• n.err: The standard error expected from the test
• n.run: How to run the test (which arguments it needs, etc.)
• n.desc: A short text description of the test
• n.prep: Code to run before the test, to set something up
• n.timeout: Time limitation(seconds)
Just like in Assignment 1, you can run and test your pzip manually. For example, to run your pzip to compress
the input file 1.in in tests/tests-pzip and save the compressed file as 1.out, enter:
$ ./pzip ./tests/tests-pzip/1.in > 1.out
For other test cases, run as follows:
$ ./pzip tests/tests-pzip/1.in tests/tests-pzip/1.in tests/tests-pzip/1.in > 2.out
$ ./pzip > 3.out
$ ./pzip tests/tests-pzip/4.in > 4.out
$ ./pzip tests/tests-pzip/5.in > 5.out
$ ./pzip tests/tests-pzip/6.in > 6.out
$ ./pzip tests/tests-pzip/7.in > 7.out
$ ./pzip tests/tests-pzip/8.in > 8.out
$ ./pzip tests/tests-pzip/9.in > 9.out
7 CS3103 – Operating Systems
The makefile could also trigger automated testing scripts, type:
$ make test
TEST 0 – clean build (program should compile without errors or warnings)
Test finished in 0.054 seconds
RESULT passed
TEST 1 – basic test – some ‘a’ characters
Test finished in 0.002 seconds
RESULT passed
TEST 2 – multiple files on command line
Test finished in 0.002 seconds
RESULT passed
TEST 3 – no files (error)
Test finished in 0.001 seconds
RESULT passed
……
TEST 6 – 100 MB file | (2 sec timeout) |
WARNING test timed out | (i.e., it took too long) |
Test finished in 2.024 seconds | |
RESULT failed | |
return code does not match expected | |
expected: got: | [0] [-15] |
standard output does not match expected
Showing first 500 bytes; use -vv to show full output
……
The job of those automated scripts is to orderly run your pzip on the test cases and check if your pzip zips
input files correctly. TEST 0 will fail if your program is compiled with errors or warnings. Time limitation is
set for each test case, and if this limit is exceeded, your test will fail. Besides, the script will record the running
time of your program for performance evaluation. Lower time/higher performance will lead to better scores.