ECE 6102: Dependable Distributed Systems
Programming Assignment 3
Due date: April 10, 2018, 11:55 PM
Main Idea
In this assignment, you will use Google Cloud Dataflow pipelines to
analyze purchasing data
generated from a wine shopping app that is similar to the one you
developed in Assignment 2. The data is stored in a csv file with one
purchased wine type per line. The file can be downloaded from the link
provided in the "Resources" section of this assignment below.
A line in the file includes purchase date and time, user ID of
purchaser, information about the wine purchased, and number of bottles
purchased. The wine information includes country, region, variety,
winery, and price per bottle. On this data set, you will carry out a
variety of analyses, which are described below.
Language Choice
Apache Beam, which is the SDK you will be using, supports both Java
and Python. While we have found the documentation sets for both
languages to be sufficient to complete the assignment, the
Java documentation is more extensive. However, we believe the Python
SDK will be easier to learn and we are also providing resources for
Python users including a program template. There is no such
template for Java and neither the TA nor the Professor will be able to
provide support to Java users should they encounter problems. For
these reasons, we strongly recommend the use of Python for this
assignment.
Part 1
Write Dataflow piplines to implement the following
analyses and run them in Google Cloud:
- 1. Count and output the total number of bottles sold for each
wine that has been purchased at least once. Each wine type has a
unique ID, which can be matched to identify that two different
purchases are for the same wine.
- 2. Calculate and output the total dollar amount of sales for each
wine. Again, use the wine ID to match wines to generate this data.
- 3. Count and output the total number of bottles sold for each winery
that has had at least one bottle purchased. Two winery names are
the same if their strings match regardless of case, i.e. convert
everything to upper case or lower case first before comparing
names.
- 4. Calculate and output the total dollar amount of sales for each
winery. Match winery names as discussed in Item 3.
- 5.-8. Repeat Items 1-4 for one of the following six varieties:
- Chardonnay
- Malbec
- Pinot Noir
- Riesling
- Sauvignon Blanc
- Zinfandel
Use the variety whose first letter is the closest to the first letter
of your last name.
Turn-in Requirements for Part 1
- The source code for all analyses should be included in a single
(Python or Java) file, based on the provided Python template file if
using Python.
- Each analysis result should be in a single csv file (i.e. 8 csv
files total should be turned in) labeled in the manner of the template's
arguments with the following formats. csv format should be \t
separated (tab separated). There should be no headers in any of the
files (e.x. NO Wine Index \t Bottles Sold on the first line):
- bottles_sold.csv:    
< wine_index > \t < bottles_sold >
- dollars_sold.csv:    
< wine_index > \t < dollars_sold >
- winery_bottles_sold.csv:    
< winery > \t < bottles_sold >
- winery_dollars_sold.csv:    
< winery > \t < dollars_sold >
- < variety>_bottles_sold.csv:    
< wine_index > \t < bottles_sold >
- < variety>_dollars_sold.csv:    
< wine_index > \t < dollars_sold >
- < variety > _winery_bottles_sold.csv:    
< winery > \t < bottles_sold >
- < variety > _winery_dollars_sold.csv:    
< winery > \t < dollars_sold >
Part 2
For each wine that was purchased at least once, find the other
wine that was purchased most often at the same time and count how many
times the two wines were purchased together. Note that one
transaction involving multiple wines is entered as multiple
wine purchases in the data set. Wines purchased at the same time have
matching dates, times, and user IDs.
Notes:
- Each wine purchased at least once should appear exactly once
as < wine_index_1 > in the csv file.
- As in Part 1, there should be no headers in the csv file.
- If a wine was never purchased with another wine, the output
should show: < wine_index_1 > \t "None" \t <
times_purchased >.
Essentially, replace the second wine index with the string "None".
times_purchased should be 0 in these cases since the number of
times it was purchased with another wine is 0.
- If a wine was purchased together the same number of times
with multiple wines and all are the largest number, the output
should show:
< wine_index_1> \t < wine_index_2 > \t < wine_index_3 > \t ... \t <
times_purchased >.
The list of wines following < wine_index_1 > must be in sorted
ascending numerical order. Essentially, concatenate on the
wines in ascending numerical order that share the max number
purchased.
Example: If A is purchased together 10 times each with B and C
and 10 is the largest "purchased-together" value, then the
output for A would look like:
"A \t B \t C \t 10".
Turn-in Requirements for Part 2
- The source code for this analysis should be included in the same
(Python or Java) file containing the source code from Part 1.
- The analysis result should be in a single csv file named
most_purchased_together.csv with format as described above.
Resources
Important notes - for both parts
-
Grading will occur by running your code on a subset of the dataset in
Google Cloud Dataflow. There are many situations where your code might
work in the DirectRunner but not in the GoogleCloudDataflow runner.
It is your responsibility to ensure that your code works correctly for
GoogleCloudDataflow. The DirectRunner will not be used in grading and
therefore will not be used in points assessment.
-
Many of the problems you are likely to encounter in this assignment
will be about the SDK. Please post these types of questions to Piazza
to help your fellow students with visibility. More specific
implementation questions are welcome via email or during office hours.
Turn-in:
On T-square, turn in the following items:
- All parts:
- If using Python:   
one Python file using the template from the resources above.
If using Java:       
one Java file.
- A README.txt containing any information needed to
run your code on Google Cloud Dataflow including the specific
commands used to run your code for all 9 parts.
Example:
- java test.java --input small_dataset.csv --output
bottles.csv --runner Direct --bottles_sold
- java test.java --input
small_dataset.csv --output dollars.csv --runner Direct
--dollars_sold
- java test.java --input
small_dataset.csv --output wine_dollars.csv --runner
Direct --winery_dollars_sold
- java test.java --input
small_dataset.csv --output wine_bottles.csv --runner
Direct --winery_bottles_sold
- Part 1: The following csv files with the below naming scheme
with formatting as described in the assignment
description.
- bottles_sold.csv
- dollars_sold.csv
- winery_bottles_sold.csv
- winery_dollars_sold.csv
- < variety > _bottles_sold.csv
- < variety > _dollars_sold.csv
- < variety > _winery_bottles_sold.csv
- < variety > _winery_dollars_sold.csv
- Part 2: The csv file named most_purchased_together.csv
with formatting as described in the assignment description.