Advanced Topics in Data Processing and Analysis

Year
1
Academic year
2014-2015
Code
03006337
Subject Area
Optional Specialties
Language of Instruction
Portuguese
Other Languages of Instruction
English
Mode of Delivery
Face-to-face
Duration
SEMESTRIAL
ECTS Credits
6.0
Type
Elective
Level
3rd Cycle Studies

Recommended Prerequisites

The student is expected to be reasonably comfortable (but not necessarily an expert) with certain foundational topics such as linear algebra, probability and statistics, functional analysis and programming

Teaching Methods

There will be exposition theoretical classes. Most of the classes consist of paper presentation by one student and final discussion by all the students enrolled in the course.

Learning Outcomes

The course will cover (advanced) topics on evaluation criteria for data quality in the web and internet, process engines for data warehousing and mining, information networks, graph mining and large scale data analysis.
At the end of the course the students will be able to design applications with challenging requirements like:
–    data quality concepts and business intelligence quality
–    Methods for Evaluating and Creating Data Quality
–    Architectures for Data Quality
–    Large graph data and robust ranking algorithms for information networks
–    efficient algorithms for processing engines
–    high-performance and scalability
These  generic goals give to the student an in-depth knowledge about the state-of-the-art of data quality management, sensor data models, processing engines for warehousing and mining, large graph data processing, information networks analysis, and analysis of Big Data engines.

Work Placement(s)

No

Syllabus

Module I - Data Quality  
- Data Quality Processes and Techniques
- Web and Internet Information Quality
- Measuring Data Quality

Module II - Large Graph Data
- Real World Datasets (Internet, Web, Social Networks)
- Dimension Reduction (SVD, Random Projection,)    
- Graph Data properties (Adjacency Matrix, Laplacian, )
- Node Importance via Random Walks (PageRank Algorithm)

Module III - Graph Mining
- Machine Learning for Structured, Semi-structured and Stream Data
- Social Network Mining, Biological Networks
- Data Mining Challenges (Case Study Computational Advertising)

Module IV - Data Processing Engines  
- Introduction to Large Scale Data Analysis and Systems
- Large Scale Data Engines
- Large-Scale Data Analysis

Assessment Methods

Assessment
Presentation and discussion: 20.0%
Paper reviews : 80.0%

Bibliography

1- Carlo Batini, Monica Scannapieco, Data Quality: Concepts, Methodologies and Techniques, Ch 1, Springer, 2006
2- M. Scannapieco, P. Missier, C. Batini: Data Quality at a Glance, Databank Spektrum 2005
3- Winkler, W.E, Methods for Evaluating and Creating Data Quality, Information Systems, vol. 29, n. 7, 2004
4- Google's PageRank and Beyond: The Science of Search Engine Rankings, Amy N. Langville & Carl  D. Meyer, Princeton Univ Press, 2006
5- M. E. J. Newman, The structure and function of complex networks, SIAM Reviews, 45(2): 167-256, 2003
6- J Leskovec, K Lang, A Dasgupta, M Mahoney, Statistical Properties of Community Structure in Large Social and Information Networks, World Wide Web , 2008
7 -S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 33:107, 1998
8- Andrew Pavlo, Erik Paulson et al., A comparison of approaches to large-scale data analysis, Proc of ACM SIGMOD Int Conf on Management of data, 165-178, 2009