ECE 6102: Dependable Distributed Systems

Programming Assignment 3

Due date: April 10, 2018, 11:55 PM

Main Idea

In this assignment, you will use Google Cloud Dataflow pipelines to analyze purchasing data generated from a wine shopping app that is similar to the one you developed in Assignment 2. The data is stored in a csv file with one purchased wine type per line. The file can be downloaded from the link provided in the "Resources" section of this assignment below.

A line in the file includes purchase date and time, user ID of purchaser, information about the wine purchased, and number of bottles purchased. The wine information includes country, region, variety, winery, and price per bottle. On this data set, you will carry out a variety of analyses, which are described below.

Language Choice

Apache Beam, which is the SDK you will be using, supports both Java and Python. While we have found the documentation sets for both languages to be sufficient to complete the assignment, the Java documentation is more extensive. However, we believe the Python SDK will be easier to learn and we are also providing resources for Python users including a program template. There is no such template for Java and neither the TA nor the Professor will be able to provide support to Java users should they encounter problems. For these reasons, we strongly recommend the use of Python for this assignment.

Part 1

Write Dataflow piplines to implement the following analyses and run them in Google Cloud:

1. Count and output the total number of bottles sold for each wine that has been purchased at least once. Each wine type has a unique ID, which can be matched to identify that two different purchases are for the same wine.

2. Calculate and output the total dollar amount of sales for each wine. Again, use the wine ID to match wines to generate this data.

3. Count and output the total number of bottles sold for each winery that has had at least one bottle purchased. Two winery names are the same if their strings match regardless of case, i.e. convert everything to upper case or lower case first before comparing names.

4. Calculate and output the total dollar amount of sales for each winery. Match winery names as discussed in Item 3.

5.-8. Repeat Items 1-4 for one of the following six varieties:

Chardonnay
Malbec
Pinot Noir
Riesling
Sauvignon Blanc
Zinfandel
Use the variety whose first letter is the closest to the first letter of your last name.

Turn-in Requirements for Part 1

The source code for all analyses should be included in a single (Python or Java) file, based on the provided Python template file if using Python.
Each analysis result should be in a single csv file (i.e. 8 csv files total should be turned in) labeled in the manner of the template's arguments with the following formats. csv format should be \t separated (tab separated). There should be no headers in any of the files (e.x. NO Wine Index \t Bottles Sold on the first line):
- bottles_sold.csv: < wine_index > \t < bottles_sold >
- dollars_sold.csv: < wine_index > \t < dollars_sold >
- winery_bottles_sold.csv: < winery > \t < bottles_sold >
- winery_dollars_sold.csv: < winery > \t < dollars_sold >
- < variety>_bottles_sold.csv: < wine_index > \t < bottles_sold >
- < variety>_dollars_sold.csv: < wine_index > \t < dollars_sold >
- < variety > _winery_bottles_sold.csv: < winery > \t < bottles_sold >
- < variety > _winery_dollars_sold.csv: < winery > \t < dollars_sold >

Part 2

For each wine that was purchased at least once, find the other wine that was purchased most often at the same time and count how many times the two wines were purchased together. Note that one transaction involving multiple wines is entered as multiple wine purchases in the data set. Wines purchased at the same time have matching dates, times, and user IDs.

Notes:

Each wine purchased at least once should appear exactly once as < wine_index_1 > in the csv file.
As in Part 1, there should be no headers in the csv file.
If a wine was never purchased with another wine, the output should show: < wine_index_1 > \t "None" \t < times_purchased >. Essentially, replace the second wine index with the string "None". times_purchased should be 0 in these cases since the number of times it was purchased with another wine is 0.
If a wine was purchased together the same number of times with multiple wines and all are the largest number, the output should show: < wine_index_1> \t < wine_index_2 > \t < wine_index_3 > \t ... \t < times_purchased >. The list of wines following < wine_index_1 > must be in sorted ascending numerical order. Essentially, concatenate on the wines in ascending numerical order that share the max number purchased.
Example: If A is purchased together 10 times each with B and C and 10 is the largest "purchased-together" value, then the output for A would look like: "A \t B \t C \t 10".

Turn-in Requirements for Part 2

The source code for this analysis should be included in the same (Python or Java) file containing the source code from Part 1.
The analysis result should be in a single csv file named most_purchased_together.csv with format as described above.

Resources

Python template (Adhere to the comments in the template. Useful hints can be obtained by running "python template.py --help". Useful libraries can still be imported within their function definitions such as for a ParDo function or Map function. Use of filesystem libraries such as [os, shutil, sys] is strictly prohibited.)
Basic info to get started with Cloud Dataflow using Python
Source code for wordcount examples used on basic info site
More detailed (but slightly older) tutorial of wordcount example (explains the pipeline properties needed to build dataflow applications)
Info on flags for Cloud Dataflow runners (pay particular attention to the numWorkers flag)
Info on Cloud Dataflow internal operation
Documentation on built-in Apache Beam transforms
How to copy files to and from the Google Cloud Datastore
Info on output file sharding
Source file for WriteToText and ReadFromText functions of Apache Beam
Page to download purchases file (most likely, you will want to debug and test your code with a subset of this file, e.g. the first 1000-5000 lines)

Important notes - for both parts

Grading will occur by running your code on a subset of the dataset in Google Cloud Dataflow. There are many situations where your code might work in the DirectRunner but not in the GoogleCloudDataflow runner. It is your responsibility to ensure that your code works correctly for GoogleCloudDataflow. The DirectRunner will not be used in grading and therefore will not be used in points assessment.
Many of the problems you are likely to encounter in this assignment will be about the SDK. Please post these types of questions to Piazza to help your fellow students with visibility. More specific implementation questions are welcome via email or during office hours.

Turn-in:

On T-square, turn in the following items:

All parts:
- If using Python: one Python file using the template from the resources above.
  If using Java: one Java file.
- A README.txt containing any information needed to run your code on Google Cloud Dataflow including the specific commands used to run your code for all 9 parts.
  Example:
  - java test.java --input small_dataset.csv --output bottles.csv --runner Direct --bottles_sold
  - java test.java --input small_dataset.csv --output dollars.csv --runner Direct --dollars_sold
  - java test.java --input small_dataset.csv --output wine_dollars.csv --runner Direct --winery_dollars_sold
  - java test.java --input small_dataset.csv --output wine_bottles.csv --runner Direct --winery_bottles_sold
Part 1: The following csv files with the below naming scheme with formatting as described in the assignment description.
- bottles_sold.csv
- dollars_sold.csv
- winery_bottles_sold.csv
- winery_dollars_sold.csv
- < variety > _bottles_sold.csv
- < variety > _dollars_sold.csv
- < variety > _winery_bottles_sold.csv
- < variety > _winery_dollars_sold.csv
Part 2: The csv file named most_purchased_together.csv with formatting as described in the assignment description.

ECE 6102: Dependable Distributed Systems Programming Assignment 3 Due date: April 10, 2018, 11:55 PM