Homework and Projects
Homework Due Covers Worth Assignment 1 9/25/2017 NoSQL choice 5% Assignment 2 10/4/2017 MapReduce 5% Assignment 3 11/30/2017 Sentence distances 20% Project Class projects ??% Class participation 10%
Advice and Hints
All homework should be uploaded to WyoCourses.
You are free to program in C, C++, Java, Python, Fortran, Lex/Yacc, or anything else that is appropriate. Talk to me about NoSQLs and other database tools first. You may use almost anything if you confirm it with me in advance.
If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean that I write the best code possible.
You should upload a single file (e.g., in zip format) containing the following:
- A PDF file describing the homework solution and what you are turining in.
- All codes.
- A description of the codes in a file (e.g., PDF, RTF, Word, or LaTeX).
- How to make the code (this could be as simple as saying, "run the make command" with or without some argument).
- How to run the code (if possible, have a make run option in a Makefile).
If I cannot recreate your solution on my own computer, you will hear from me.
Investigate non-relational databases that are available on the web for free. Choose one that you want the entire class to use. Be prepared on Tuesday, September 26 to announce your choice and be able to defend your decision verbally.
Turn in a list of which NoSQLs you investigated and which one is your top choice. Provide a brief justification for your choice.
Choose a MapReduce system that you are capable of programming the map and reduce functions. Download the key-value data.. There are 1000 lines of data. The keys are numbers and the values are random characters. There is a single blank character between a key and its value.
Your task is to sort the keys in ascending order and write out the sorted keys with their associated data using MapReduce.
Turn in a report that details which MapReduce system you used and your sorting strategy. How many map and reduce functions did you run with and why? Also turn in your map and reduce functions.
Do the Big Sentences problem for 0-, 1-, and k-distances, where k > 1. There are other problems at that web site, but only the sentences problem concerns you. Your goals include the following:
- Produce efficiently and quickly a smaller file with no duplicate lines nor lines with only one add/delete of a single word. Line pairwise comparison is too expensive since it takes O(n2/2) comparisons of n lines. Big data similarity/identity finding techniques must be employed for a near linear time solution.
- You want to live long enough to see the results when 25,000,000 lines are involved.
- Start working on this problem soon enough to finish and not see O(n2/2) run times in the table below. Downloading the files takes time, too. Start early and work on it often to make improvements.
Turn in all of your files in a zip or gzipped tar file. Include a PDF file that describes what you did, how you did it, and the results. Make certain there are clear instructions how to duplicate your results in case I want to do so. Include source code, MapReduce functions, hash functions, and anything else relevant, but do not send me the input files. Include detailed instructions on how to compile and run your code.
Complete the following table (replace k > 1 by an integer):
Input file Distance 0 Distance 1 Distance k > 1 Wallclock time Output lines Wallclock time Output lines Wallclock time Output lines 100.txt 1K.txt 10K.txt 100K.txt 1M.txt 5M.txt 25M.txt
Finally, plot the wallclock times versus the size of the data set.
The following project topics have been proposed:
- None so far