University of Wyoming MA 5490, 2017 Fall

Tuesday and Thursday, 8:10-9:25, 247 Ross Hall

Office Hours: Tuesday 9:30-10:30, Wednesday 11:00-12:00, and Friday 3:00-4:00

Professor Craig C. Douglas

Homework and Projects

Homework Due Covers Worth
Assignment 1 9/25/2017 NoSQL choice 5%
Assignment 2 10/4/2017 MapReduce 5%
Assignment 3 11/30/2017 Sentence distances 20%
Project   Class projects ??%
Class participation     10%

Advice and Hints

All homework should be uploaded to WyoCourses.

You are free to program in C, C++, Java, Python, Fortran, Lex/Yacc, or anything else that is appropriate. Talk to me about NoSQLs and other database tools first. You may use almost anything if you confirm it with me in advance.

If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean that I write the best code possible.

You should upload a single file (e.g., in zip format) containing the following:

If I cannot recreate your solution on my own computer, you will hear from me.

Homework

Assignment 1

Investigate non-relational databases that are available on the web for free. Choose one that you want the entire class to use. Be prepared on Tuesday, September 26 to announce your choice and be able to defend your decision verbally.

Turn in a list of which NoSQLs you investigated and which one is your top choice. Provide a brief justification for your choice.

Assignment 2

Choose a MapReduce system that you are capable of programming the map and reduce functions. Download the key-value data.. There are 1000 lines of data. The keys are numbers and the values are random characters. There is a single blank character between a key and its value.

Your task is to sort the keys in ascending order and write out the sorted keys with their associated data using MapReduce.

Turn in a report that details which MapReduce system you used and your sorting strategy. How many map and reduce functions did you run with and why? Also turn in your map and reduce functions.

Assignment 3

Do the Big Sentences problem for 0-, 1-, and k-distances, where k > 1. There are other problems at that web site, but only the sentences problem concerns you. Your goals include the following:

  1. Produce efficiently and quickly a smaller file with no duplicate lines nor lines with only one add/delete of a single word. Line pairwise comparison is too expensive since it takes O(n2/2) comparisons of n lines. Big data similarity/identity finding techniques must be employed for a near linear time solution.
  2. You want to live long enough to see the results when 25,000,000 lines are involved.
  3. Start working on this problem soon enough to finish and not see O(n2/2) run times in the table below. Downloading the files takes time, too. Start early and work on it often to make improvements.

Turn in all of your files in a zip or gzipped tar file. Include a PDF file that describes what you did, how you did it, and the results. Make certain there are clear instructions how to duplicate your results in case I want to do so. Include source code, MapReduce functions, hash functions, and anything else relevant, but do not send me the input files. Include detailed instructions on how to compile and run your code.

Complete the following table (replace k > 1 by an integer):

Input file Distance 0 Distance 1 Distance k > 1
  Wallclock time Output lines Wallclock time Output lines Wallclock time Output lines
100.txt            
1K.txt            
10K.txt            
100K.txt            
1M.txt            
5M.txt            
25M.txt            

Finally, plot the wallclock times versus the size of the data set.

Projects

The following project topics have been proposed:

 

Cheers,
Craig C. Douglas

Last modified: