Week 1 (3/31) [Slides]
Course Overview
Introduction to big-data management: frameworks, data management, analytics, machine learning, etc. The classes focus on frameworks and data management.
1. What is big data
2. Frameworks and cloud computing
3. OLTP vs OLAP vs BigData
4. Big data frameworks
5. Data cleaning
Weeks 2-3 (4/7, 4/14) [Slides]
Relational Databases: SQL refresher, relational model, xml, json, semi-structured data, RDBMS, AWS RDS.
1. Database System Concepts by Avi Silberschatz, Henry F. Korth, and S. Sudarshan
2. Database Management Systems by Johannes Gehrke and Raghu Ramakrishnan
3. Database Systems: The Complete Book by Héctor García-Molina, Jeffrey Ullman, and Jennifer Widom
Week 4 (4/21) [Slides]
Limitations of RDBMSs and motivation for NoSQL; Intro to MongoDB
Non-RDBMS: Key-value stores, distributed storage, NoSQL storage: Column Stores (C-store, HBase, Cassandra)
1. Good summary of all NoSQL material: http://www.christof-strauch.de/nosqldbs.pdf
2. Stonebraker, Mike, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau et al. "C-store: a column-oriented DBMS." In Proceedings of the 31st international conference on Very large data bases, pp. 553-564. VLDB Endowment, 2005.
3. DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. "Dynamo: amazon's highly available key-value store." In ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205-220. ACM, 2007.
4. MongoDB manual: https://dlib.hust.edu.vn/bitstream/HUST/23974/1/OER000003074.pdf?utm_source=chatgpt.com
Week 5 (4/28) [cont'd]
MongoDB application and examples (see material on Canvas) - https://www.mongodb.com/
Choosing the right database, with focus on AWS
Review [review slides]
Review session (TA)
Week 6 (5/5) [Vector Store slides]
Midterm 1 (80 minutes)
Data Management for AI and Vector Stores
Survey of vector database management systems. JJ Pan, J Wang, G Li - arXiv preprint arXiv:2310.14021, 2023 - arxiv.org
https://learn.microsoft.com/en-us/data-engineering/playbook/solutions/vector-database/
https://www.mongodb.com/resources/basics/databases/vector-databases
Week 7 (5/12) [Data Management for AI and Vector Stores (cont'd)]
Week 8 (5/19) [Big Data Frameworks Slides]
Intro to Big data frameworks and Spark; Lakehouse architecture
1. Chapter 2 of Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
2. Learning Spark: Lightning-Fast Big Data Analysis by Andy Konwinski, Holden Karau, Matei Zaharia, and Patrick Wendell
3. Zaharia, Matei, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2. USENIX Association, 2012.
Week 9 (5/26) [Spark (cont'd]
Spark (cont'd)
Review session (TA) [Slides]
Week 10 (6/2) [Lakehouse and Column Stores]
Lakehouse and Column Stores (cont'd)
What Is a Lakehouse: https://www.databricks.com/blog/what-is-data-lakehouse
What Is a Lakehouse: https://aws.amazon.com/what-is/data-lakehouse/
Midterm 2 (80 minutes)