Advanced Infrastructures for Data Science
2
2025-2026
02038700
Informatics
Portuguese
English
Face-to-face
SEMESTRIAL
6.0
Compulsory
2nd Cycle Studies - Mestrado
Recommended Prerequisites
Good programming skills and knowledge on distributed systems and cloud computing topics are recommended.. Fluency in English level B2 (ideally C1), according to the Common European Framework of Reference for Languages.
Teaching Methods
Lecture classes (T): presentation and discussion around the topics of the course.
Lab classes (PL): application of theoretical concepts in projects.
Learning Outcomes
The curricular unit aims to provide students with skills in the area of computational infrastructures required for data science, essentially in
understanding and managing computational clusters to support the storage, processing and distribution of large amounts of data. The aim is also to explore the creation, deployment and orchestration of containers (micro-services) in Data Science application contexts. This will include exploring frameworks such as Docker and Kubernetes. This course also aims to provide students with knowledge of Big Data frameworks. In this case, the aim is for students to be familiar with the main Big Data architectures and to have contact with the main reference frameworks - Apache Hadoop and Apache Spark. For both frameworks, scenarios of distributed storage and processing of large amounts of data will be studied and experimented, whether in batch or real-time scenarios.
Work Placement(s)
NoSyllabus
A. Data Science Infrastructures – Intro
1. Support infrastructures for Data Science: an introduction
2. Managing data center infrastructures: computing, storage and communications
B. Containers Deployment & Orchestration
3. Containerized applications (e.g., Docker)
4. Container orchestration systems – Imperative and Declarative Approaches (e.g., Kubernetes, Mesos, Docker Swarm)
C. Big Data
5. Big data scalable and distributed transport (e.g., Apache Kafka)
6. Big data architectures: Kappa and Lambda
7. Big data distributed storage frameworks (e.g., Apache Hadoop HDFS)
8. Big data cluster resources orchestration frameworks (e.g., Apache Hadoop YARN, Spark Cluster)
9. Big data (batch and real-time) distributed processing frameworks (e.g., Apache Hadoop MapReduce, Spark)
Head Lecturer(s)
Pedro Miguel Naia Neves
Assessment Methods
Assessment
Mini Tests: 30.0%
Project: 30.0%
Exam: 40.0%
Bibliography
- Artigos, recursos disponíveis na Internet e capítulos de livros seleccionados, para cada tópico especializado.
1. Ramcharan Kakarla, Sundar Krishnan, Sridhar Alla, Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle, 2024
2. Gabriel Schenker, The Ultimate Docker Container Book: Build, Test, Ship, and Run Containers wtih Docker and Kubernetes, Packt Publishing, 3rd edition, Apress, 1st edition, 2020
3. Jonathan Rioux, Data Analysis with Python and PySpark, 1st edition, 2022
4. Gwen Shapira, Todd Palino, Rajini Sivaram and Krit Petty, Apache Kafka: the definitive guide, O'Reilly, 2nd edition, 2021
5. Jules Damji, Denny Lee, Brooke Wenig, Learning Spark: Lightning-Fast Big Data Analytics, O'Reilly, 2nd edition, 2020
6. Jan Kunigk, Ian Buss (Author), Paul Wilkinson, Lars George, Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale, O'Reilly, 1st edition, 2019