Big Data Analytics

Spring 2022

Homework #3

Due Date: March 2nd, 2022, 10PM

Consider the following set of transactions and

answer the following questions.

1. (30) Using the data in this table show the

execution of the two parallel algorithms,

Count Distribution and Data Distribution,

described in the paper discussed in class on

Feb 17th. Assume a minimum support level

of 3. Show the execution of each algorithm

only till the computation of frequent 2-

itemsets.

2. (10) Which of the above two algorithms, in

general, will require the most

communication among the servers? Justify

your answer and you can use instances

from answer to Q#1 above.

3. (10) Which of the above two algorithms, in

general, will require the smallest memory

at servers? Justify your answer and you can

use instances from answer to Q#1 above.

4. (15) What issues will need to be addressed if we want to use the third algorithm, the

Candidate distribution algorithm, on this data. Specifically, address the issues of which

data and which candidates will go on each processor. Do not show the execution of the

algorithm for this data.

5. (20) Use FP-Tree algorithm to generate a separate FP-tree at each of the servers. Now

merge the two FP-Trees to form a single FP-Tree.

6. (15) Propose your intuitive idea for an algorithm to, in general, merge two FP-trees to

form a merged FP-Tree, and give a pseudo-code for your proposed algorithm. For a

context, assume that each Map server uses a combiner to generate an FP-Tree for its

data chunk. The Reducer then takes the trees from all Mapper servers and generates

the combined global FP-Tree.

Trans ID | Items in Transaction |
Server in which the item resides |

T1 | A B C E M R | Server1 |

T2 | B C G M P | Server1 |

T3 | A C E T | Server1 |

T4 | B E G P | Server1 |

T5 | A B G P Q | Server2 |

T6 | B C G M | Server2 |

T7 | A B C G | Server2 |

T8 | A C G S T | Server2 |