Advanced Infrastructures for Data Science

Year
2

Academic year
2020-2021

Code
02038700

Subject Area
Optional

Language of Instruction
Portuguese

Other Languages of Instruction
English

Mode of Delivery
Face-to-face

Duration
SEMESTRIAL

ECTS Credits
6.0

Type
Elective

Level
2nd Cycle Studies - Mestrado

Recommended Prerequisites

Good programming skills and knowledge on distributed systems and cloud computing topics are recommended.. Fluency in English level B2 (ideally C1), according to the Common European Framework of Reference for Languages.

Teaching Methods

Lecture classes (T): presentation and discussion around the topics of the course.

Lab classes (PL): application of theoretical concepts in projects.

Learning Outcomes

The main objectives of the course are focused on providing a theoretical and practical approach to the management of high performance IT services and infrastructures, built from the ground up to support big data processing solutions, also including the planning and administration of such infrastructures, as well as the management of existing resources. The curricular organization of the course aims to guide the students through a path leading to the acquisition of skills in areas ranging from the management of virtualisation clusters and data centers to the orchestration of micro-services, in a perspective focused on providing support for big data processing frameworks (such as the Apache Hadoop and Spark, as an example).

In this course, students should acquire skills in understanding, analyzing and summarizing the subjects addressed, critical thinking, organization and planning, problem solving, group work, autonomous learning, and practical application of knowledge.

Work Placement(s)

Syllabus

1. Support infrastructures for Data Science: an introduction

2. Managing data center infrastructures: computing, storage and communications

3. Container orchestration systems (ex. Kubernetes, Docker, Vagrant, Mesos)

4. Real-time big data architectures: Kappa and Lambda

5. Scalable and distributed transport (ex. Apache Kafka)

6. Big data processing frameworks (ex. Apache Hadoop and Spark)

7. The placement problem: optimizing data ingestion and processing on massively distributed architectures

8. Advanced cloud computing topics

9. Resource orchestration and management: planning for scalability.

Assessment Methods

Assessment
Research work: 20.0%
Exam: 40.0%
Laboratory work or Field work: 40.0%

Bibliography

- Artigos, recursos disponíveis na Internet e capítulos de livros seleccionados, para cada tópico especializado.

-Neha Narkhede, Gwen Shapira, and Todd Palino, Apache Kafka: the definitive guide (2017)

-Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau, Learning Spark: Lightning-Fast Big Data Analysis (2015)

-Jan Kunigk, Ian Buss (Author), Paul Wilkinson, Lars George, Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale (2019)