Week 1 (4/1) [Slides]
Course Overview
Introduction to big-data management: frameworks, data management, analytics, machine learning, etc. The classes focus on frameworks and data management.
1. What is big data
2. Frameworks and cloud computing
3. OLTP vs OLAP vs BigData
4. Big data frameworks
5. Data cleaning
Weeks 2-3 (4/8, 4/15) [Slides]
Relational Databases: SQL refresher, relational model, xml, json, semi-structured data, RDBMS, AWS RDS.
1. Database System Concepts by Avi Silberschatz, Henry F. Korth, and S. Sudarshan
2. Database Management Systems by Johannes Gehrke and Raghu Ramakrishnan
3. Database Systems: The Complete Book by Héctor García-Molina, Jeffrey Ullman, and Jennifer Widom
Week 4 (4/22) [Slides]
Limitations of RDBMSs and motivation for NoSQL; Intro to Cassandra
Non-RDBMS: Key-value stores, distributed storage, NoSQL storage: Column Stores (C-store, HBase, Cassandra)
1. Good summary of all NoSQL material: http://www.christof-strauch.de/nosqldbs.pdf
2. http://cassandra.apache.org/
3. Stonebraker, Mike, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau et al. "C-store: a column-oriented DBMS." In Proceedings of the 31st international conference on Very large data bases, pp. 553-564. VLDB Endowment, 2005.
4. DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. "Dynamo: amazon's highly available key-value store." In ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205-220. ACM, 2007.
5. George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
Week 5 (4/29) [cont'd]
NoSQL (cont'd), MongoDB
1. https://www.mongodb.com/
Review [review slides]
Review session (TA)
Week 6 (5/6)
MongoDB application and examples (see material on Canvas)
Choosing the right database, with focus on AWS
Midterm 1 (80 minutes)
Week 7 (5/13) [Slides]
Intro to Big data frameworks: Distributed file systems with focus on HDFS, MapReduce, Hadoop, and Spark.
1. Chapter 2 of Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
2. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51, no. 1 (2008): 107-113.
3. Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler. "The hadoop distributed file system." In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, pp. 1-10. Ieee, 2010.
4. White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
Week 8 (5/20) [MapReduce Slides, Spark Slides]
MapReduce and Spark Programming
1. White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
2. Learning Spark: Lightning-Fast Big Data Analysis by Andy Konwinski, Holden Karau, Matei Zaharia, and Patrick Wendell
3. Zaharia, Matei, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2. USENIX Association, 2012.
Week 9 (5/27) [cont'd]
Spark programming (cont'd)
Review session (TA) [Slides]
Week 10 (6/3)
Spark programming (cont'd), Column Stores and ETL
Midterm 2 (80 minutes)