# Statistics for Data Science

**Year**

2

**Academic year**

2023-2024

**Code**

01016614

**Subject Area**

Mathematics

**Language of Instruction**

Portuguese

**Mode of Delivery**

Face-to-face

**Duration**

SEMESTRIAL

**ECTS Credits**

6.0

**Type**

Compulsory

**Level**

1st Cycle Studies

## Recommended Prerequisites

Statistics; Numeric Linear Algebra and Scientific Computing

## Teaching Methods

The teaching methods are based on a combination of conventional classes where the topics are motivated and introduced, with the support of slides, software and illustrations (theoretical classes) and classes demonstrating concepts and their computational implementation (practical classes). In the course of the classes' period, the students consolidate the learning outcomes with projects carried out in groups where the methods and tools are applied autonomously under the supervision of the teacher.

## Learning Outcomes

Upon the successful completion of this course, students should be prepared to conduct a statistical analysis in situations involving large datasets, both in terms of the number of variables and the number of observations collected. This course addresses the fundamental multivariate statistical concepts, tools and methods required to describe, model and make inferences in data-intensive contexts. The methods are organized in monoblock, double-block and multi-block, depending on the typology of the problem and the existing datasets. Students should also be able to validate developed models and make decisions about the proper level of complexity to adopt. They should also understand the diverse nature of the Frequencist/Bayesian approaches and the difference between Association/Causality.

## Work Placement(s)

No## Syllabus

Part I - Introduction

1. Revisions of matrix algebra for multivariate data analysis.

2. Multivariate probability distributions

3. The Bayesian and Frequentist perspectives. Inference and analysis.

4. Causality and association.

Part II - Inference and Modeling

5. Inference and hypothesis testing for multivariate and high-dimensional samples. The problem of excessive statistical power.

6. Monoblock modeling and analysis (X)

• Principal Components Analysis

• Independent Component Analysis

7. Double-block modeling and analysis (X-> Y)

• Canonical Correlation Analysis

• Partial Least Squares

• The problem of collinearity and sparsity

8. Multi-block modeling and analysis (X, Y, ...)

9. Probabilistic graphical models

10. Non-linear modeling. Kernalization.

Part III - Validation and Analysis

11. Evaluation of models and their selection. Bias-variance trade-off. Analysis of complexity.

12. Analysis of the quality of information generated in an empirical study (InfoQ).

## Head Lecturer(s)

Marco Paulo Seabra dos Reis

## Assessment Methods

Assessment

*Resolution Problems: 10.0%*

*Project: 40.0%*

*Exam: 50.0%*

## Bibliography

Eriksson, L., Johansson, E., Kettaneh-Wold, N., & Wold, S. (2001). Multi- and Megavariate Data Analysis – Principles and Applications. Umeå (Sweden): Umetrics AB.

Johnson, R. A., & Wichern, D. W. (2018). Applied Multivariate Statistical Analysis (6th ed.). Upper Sadle River, NJ: Prentice Hall.

Hair, J. F., Jr., Anderson, R. E., Tatham, R. L., & Black, W. C. (2018). Multivariate Data Analysis (8th ed.). Upper Saddle River, NJ: Prentice-Hall.

Dillon, W. R., & Goldstein, M. (1984). Multivariate Analysis - Methods and applications. New York: Wiley.

Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). New York: Springer.

Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). NY: Wiley.

Montgomery, D. C., & Runger, G. C. (1999). Applied Statistics and Probability for Engineers (2nd ed.). New York: Wiley.