Distributed Data Processing Environments

Logo

Bachelor in Data Science

2025/2026

The course aims at providing hands-on experience with state of the art distributed data processing environments. First, it addresses cloud computing concepts and tools. Then, it provides a bottom-up overview of distributed storage and processing technologies, emphasizing scalability and usability. Finally, it introduces basic systems and information security concepts, as required to safely use current distributed environments.

Learning outcomes

Instructors

Grading

The grade has two components:

Submitted projects must be fully authored by the students and must not contain materials (text, code, …) from third parties, obtained online, or using AI tools unless explicitly marked and authorized by the instructors. See Academic Regulations and Code of Ethical Conduct for more information.

Schedule

# Date Topic Mat. Read
T1 18/9/25 Introduction. 🗎  
PL1 19/9/25 Linux installation and basics. 🗎  
T2 25/9/25 Virtualization and cloud. 🗎 R1 1-5; R2 9,24
PL2 26/9/25 Provisioning. 🗎  
PL3 3/10/25 Cloud access. 🗎  
T3 9/10/25 Storage management. 🗎 R2 20
PL4 10/10/25 Instance management. 🗎 (V)  
T4 16/10/25 File formats. 🗎 R3
PL5 17/10/25 Instance management (cont).    
PL6 23/10/25 Storage and filess. 🗎 (V)  
PL7 24/10/25 Storage and files (cont).    
T5 30/10/25 Query execution. 🗎 R4 4
PL8 31/10/25 Query execution. 🗎  
PL9 6/11/25 Query execution (cont).    
T6 7/11/25 Query optimization. 🗎 R4 4
T7 13/11/25 Distributed execution. 🗎 R4 3; R5
PL10   Distributed execution 🗎  
T8 20/11/25 Security. 🗎 R6 1-3
PL10 21/11/25 Orquestration. 🗎 (V)  
PL11 27/11/25 Project.    
PL12 28/11/25 Project.    
T9 4/12/25 Cryptography.   R6 14
PL12 5/12/25 Project.    

Bibliography

# Title
R1 Fox, A., et al. Above the clouds: A Berkeley view of cloud computing Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28.13 (2009).
R2 Nemeth, E., Snyder, G., Hein, T.R., Whaley, B., Mackin, D. UNIX and Linux System Administration Handbook (5th Edition), Addison-Wesley Professional, 2017.
R3 Aditya Somani. A Data Engineer’s Guide to Columnar Storage
R4 J. M. Hellerstein, M. Stonebraker, and J. Hamilton. Architecture of a Database System Foundations and Trends® in Databases, vol. 1, no. 2, pp. 141–259, 2007.
R5 Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, Comm. ACM, 2008.
R6 Dieter Gollmann. Computer Security. Wiley, 2011.
A1 J. Pereira. Introdução ao Unix U. Minho, (2025).
A2 Alex Braunton. Hands-On DevOps with Vagrant, Packt Publishing, 2018.
A3 Mark Needham, Michael Hunger and Michael Simons. DuckDB in Action Manning, 2024.