# Common Homework Problems

Over time I have developed some favorite problems for a variety of courses. These include, for example,

• High Performance Computing and eXtreme Technical Computing class
• Big Data class

The problems include

• Big Sentences: Consider any of the text files in the smaller datasets first. Start with tiny.txt, then use small.txt, and progressively work up to the 1M.txt file, which contains 1,000,000 lines of text. The moderate dataset expands to 428 MB (5M.txt). The largest dataset expands to 2.1 GB (25M.txt). All characters are lower case and all words are separated by a single blank. Words may be nonsensical or from an established language. All lines end in a linefeed (UNIX style line end), but some have an extra blank after the last word. The larger files are based on a wonderful and quirky Stanford University file for a course there on data mining that sadly no longer seems to be on their web servers.
• Filter out the nonunique lines.
• Filter out the distance 1 (or n>0) lines. Alternately, find only the distance 1 (or n) lines. Distance n means that you can transform string x into string y by at most n changes to x, where each change is from {add 1 word, delete 1 word, or swap 2 neighboring words}.
• For each of the datasets of size of at least 100 lines, plot the wall clock time. If you are using an obvious and simple algorithm, you will see a parabola since your algorithm will be O(k2), for a file with k lines. Somewhat better will be O(klog(k)). Far, far better will be almost O(k).
• Fracking: Consider any of the Excel files Excel spreadsheets, either the subset of csv files or the full set of xlsx files. You may find converting .xlsx files to .csv easier to work with. There is software on the Internet to read either format, however. Warning: There are geographical locators in the fracking spreadsheets. If you live in a country in which it is illegal to bring this data in, do not do so. If I get enough requests, then I will put together data without the locations. All of this data is publically available on the Internet without restrictions. Simple questions include the following:
• Find all of the nonproducing wells and relate them relative to each other geographically.
• Are there some production companies with significantly more nonproducing wells than others?
• Are there some production companies with significantly more producing wells than others?
• Student recruiting versus success in school: This is a protected set of data. Warning: No information is available without authorization. Trivial questions include the following:
• Where should this university recruit?
• Are there any U.S. states that this university should no longer accept students from?
• Where in Wyoming and neighboring states are the best or worst recruiting areas?
• Analyze the graduation rate and grade point average versus financial aid (two topics).
• How did one Wyoming state resident acquire a distinctive O(\$105) in student loans (just attending college), assuming he or she had a Hathaway grant and did not pay the in state \$4K/year tuition or fees covered?

