Skip to content

Instantly share code, notes, and snippets.

View vanvaridiksha's full-sized avatar

Diksha Vanvari vanvaridiksha

  • Columbia University
  • New York
View GitHub Profile
@vanvaridiksha
vanvaridiksha / Kafka-Mini-Homework.md
Last active October 21, 2016 05:06
Description of a mini assignment for the students of COMS 6998 - Cloud Computing and Big Data at Columbia University

#Twitter Streaming Using Kafka

  • Last week, you read the Kafka paper and summarized it. This week, you will be using Kafka and Zookeeper to stream Twitter data.
  • You can reuse code from your first homework for reading tweets using a twitter API library of your choice. The focus of this assignment will be on familiarizing you with Kafka and Zookeeper.

##Installation and Setup

Download and install Zookeeper and Kafka on your machines. The steps required depend on the platform you are using. There a are a lot of tutorials readily available on this. Follow any tutorial, and if you get stuck, your TAs can help you with this.

##Problem Statement:

@vanvaridiksha
vanvaridiksha / csds-spark.md
Last active July 3, 2017 15:51
Spark Assignment 3

#Spark Assignment 3

This assignment is based on the 3rd chapter of the book - 'Advanced Big Data Analytics with Spark'. You can find a copy of the book here.

In this assignment, you will be required to build a recommendation system using Spark and MLib using a dataset published by AudioScrobbler. This data is 500MB uncompressed and can be downloaded here.

You can glance through the chapter and understand how to build the recommendation system. The chapter will walk you through Scala code;

@vanvaridiksha
vanvaridiksha / csds.md
Last active March 6, 2016 04:52
CSDS Assignment 2

#CSDS Hive Assignment 2

In the previous assignment you worked with the Hadoop, HDFS and Hive environments to perform simple map-reduce and basic operations like loading data in HDFS and querying on a small Hive database. Now we shall see and learn how to work with actual big data. In this assignment you shall write your own map-reduce programs to perform more sophisticated tasks. Further you will create your own Hive database given the dataset in raw form and run few queries on it.

This assignment can be performed on the same cloudera virtual machine that was used for the previous assignment. No further setup or installation will be needed. You are free to use any language of your choice for this assignment.

##About the Dataset (editing of technical details might be needed)

File Name - server-logz.gz (300 Mb)