ECE 6102: Dependable Distributed Systems

Programming Assignment 3

Due date: April 10, 2018, 11:55 PM



Main Idea

In this assignment, you will use Google Cloud Dataflow pipelines to analyze purchasing data generated from a wine shopping app that is similar to the one you developed in Assignment 2. The data is stored in a csv file with one purchased wine type per line. The file can be downloaded from the link provided in the "Resources" section of this assignment below.

A line in the file includes purchase date and time, user ID of purchaser, information about the wine purchased, and number of bottles purchased. The wine information includes country, region, variety, winery, and price per bottle. On this data set, you will carry out a variety of analyses, which are described below.

Language Choice

Apache Beam, which is the SDK you will be using, supports both Java and Python. While we have found the documentation sets for both languages to be sufficient to complete the assignment, the Java documentation is more extensive. However, we believe the Python SDK will be easier to learn and we are also providing resources for Python users including a program template. There is no such template for Java and neither the TA nor the Professor will be able to provide support to Java users should they encounter problems. For these reasons, we strongly recommend the use of Python for this assignment.

Part 1

Write Dataflow piplines to implement the following analyses and run them in Google Cloud:
1. Count and output the total number of bottles sold for each wine that has been purchased at least once. Each wine type has a unique ID, which can be matched to identify that two different purchases are for the same wine.

2. Calculate and output the total dollar amount of sales for each wine. Again, use the wine ID to match wines to generate this data.

3. Count and output the total number of bottles sold for each winery that has had at least one bottle purchased. Two winery names are the same if their strings match regardless of case, i.e. convert everything to upper case or lower case first before comparing names.

4. Calculate and output the total dollar amount of sales for each winery. Match winery names as discussed in Item 3.

5.-8. Repeat Items 1-4 for one of the following six varieties:

  • Chardonnay
  • Malbec
  • Pinot Noir
  • Riesling
  • Sauvignon Blanc
  • Zinfandel

Use the variety whose first letter is the closest to the first letter of your last name.

Turn-in Requirements for Part 1

Part 2

For each wine that was purchased at least once, find the other wine that was purchased most often at the same time and count how many times the two wines were purchased together. Note that one transaction involving multiple wines is entered as multiple wine purchases in the data set. Wines purchased at the same time have matching dates, times, and user IDs.

Notes:

Turn-in Requirements for Part 2

Resources

Important notes - for both parts

Turn-in:

On T-square, turn in the following items: