DECENTRALIZED SYSTEMS FOR BIG DATA MANAGEMENT AND DECISION MAKING
Σπύρος Σιούτας
The course’s aim is to introduce students to the following two pillars:(1) Foundations of Advanced Decentralized Computing Systems (2) Practical Overview of non-traditional software systems for big data management (with emphasis in Spark, Python, and PySpark).
Especially, it will focus on the following topics:
- Hashing, Bloom Filters, Internet Caching Protocols, Distributed Hash Tables.
- Decentralized Data Structures and P2P Systems, DHT-based Decentralized Systems (Chord).
- Block-Chain and Decentralized Applications (DAPPs): Hashing Data in the Real World, Storing Transaction Data, Using the Data Store, Protecting the Data Store, Distributing the Data Store Among Peers, Verifying and Adding Transactions, Choosing a Transaction History.
- Distributed File Systems (HDFS), Map/Reduce Programming Framework and NoSQL Databases, Cluster Architecture, Data Flow Systems, Spark, RDDs.
- Overview of Python for big data management: Introduction to libraries and tools (pandas, NumPy, etc.), Introduction to PySpark, Understanding PySpark's architecture and components
- Big Data Storage and Processing in Decentralized Systems: Batch processing (MapReduce, Spark), Stream processing (Spark Streaming, Flink)
- Large-Scale Machine Learning with PySpark
- Introduction to Machine Learning
- Large Scale Machine Learning
- Introduction to MLlib
- Overview of MLlib, PySpark's machine learning library
- Algorithms supported by MLlib (regression, classification, clustering, etc.)
- Distributed machine learning with PySpark
8.Advanced Topics and Case Studies
- Future trends in decentralized systems for big data: IoT and Cloud, Containers, Dockers, Kubernetes
- Practical: Mini-project combining concepts from previous lectures
SYLLABUS:
Week #1: Hashing, Bloom Filters, Internet Caching Protocols, Distributed Hash Tables.
Week #2: Decentralized Data Structures and P2P Systems, DHT-based Decentralized Systems (Chord).
Week #3: Block-Chain and Decentralized Applications (DAPPs).
Week #4: HDFS, Map/Reduce Programming Framework and NoSQL Databases, Cluster Architecture, Data Flow Systems, Spark, RDDs.
Week #5: Overview of Python for data management in Decentralized Systems
(Practical Part: Basic data manipulation with Python and PySpark).
Week #6: Big Data Storage and Processing in Decentralized Systems.
(Practical Part: Batch processing with PySpark).
Week #7: Big Data Storage and Processing in Decentralized Systems (cont’d).
(Practical Part: Batch processing with PySpark).
Week #8: Large Scale Machine Learning with PySpark
(Practical Part: Implementation of a simple machine learning model using Python's scikit-learn)
Week #9: Large Scale Machine Learning with PySpark (Cont’d).
(Practical Part: Implementation of a simple machine learning model using Python's scikit-learn)
Week #10: Large Scale Machine Learning with PySpark (Cont’d).
(Practical Part: Implementation of a machine learning model with PySpark's MLlib).
Week #11: Large Scale Machine Learning with PySpark (Cont’d).
(Practical Part: Implementation of a machine learning model with PySpark's MLlib).
Week #12: Advanced Topics and Case Studies.
(Practical Part: Project combining concepts from previous lectures).
Week #13: Advanced Topics and Case Studies (Cont’d).
(Practical Part: Project combining concepts from previous lectures).
STUDENT PERFORMANCE EVALUATION:
Assignments (100%):
- Research Paper Presentation (30% - 50%)
- Project Implementation (50% - 70%)
BIBLIOGRAPHY
- Spark: The Definitive Guide, by Bill Chambers, Matei Zaharia, Released February 2018, Publisher(s): O'Reilly Media, Inc., ISBN: 9781491912218.
- PySpark Tutorial For Beginners (Spark with Python),
- https://sparkbyexamples.com/pyspark-tutorial/
- The Google File System:
<https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf>
- Map-Reduce:
<https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf>
- Hadoop: https://hadoop.apache.org/
- Decentralized Frameworks for Future Power Systems, 1st Edition - May 12, 2022, Mohsen Parsa Moghaddam, Reza Zamani, Hassan Haes Alhelou, Pierluigi Siano, ISBN: 9780323916981
- Block-chain Basics: A Non-Technical Introduction in 25 Steps, Daniel Drescher, ISBN-13 (electronic): 978-1-4842-2604-9, DOI 10.1007/978-1-4842-2604-9, Copyright © 2017 by Daniel Drescher. https://app.scnu.edu.cn/iscnu/learning/block_chain/Blockchain%20basic.pdf
The course’s aim is to introduce students to the following two pillars:(1) Foundations of Advanced Decentralized Computing Systems (2) Practical Overview of non-traditional software systems for big data management (with emphasis in Spark, Python, and PySpark).
Especially, it will focus on the following topics:
- Hashing, Bloom Filters, Internet Caching Protocols, Distributed Hash Tables.
- Decentralized Data Structures and P2P Systems, DHT-based Decentralized Systems (Chord).
- Block-Chain and Decentralized Applications (DAPPs): Hashing Data in the Real World, Storing Transaction Data, Using the Data Store, Protecting the Data Store, Distributing the Data Store Among Peers, Verifying and Adding Transactions, Choosing a Transaction History.
- Distributed File Systems (HDFS), Map/Reduce Programming Framework and NoSQL Databases, Cluster Architecture, Data Flow Systems, Spark, RDDs.
- Overview of Python for big data management: Introduction to libraries and tools (pandas, NumPy, etc.), Introdu
The course’s aim is to introduce students to the following two pillars:(1) Foundations of Advanced Decentralized Computing Systems (2) Practical Overview of non-traditional software systems for big data management (with emphasis in Spark, Python, and PySpark).
Especially, it will focus on the following topics:
- Hashing, Bloom Filters, Internet Caching Protocols, Distributed Hash Tables.
- Decentralized Data Structures and P2P Systems, DHT-based Decentralized Systems (Chord).
- Block-Chain and Decentralized Applications (DAPPs): Hashing Data in the Real World, Storing Transaction Data, Using the Data Store, Protecting the Data Store, Distributing the Data Store Among Peers, Verifying and Adding Transactions, Choosing a Transaction History.
- Distributed File Systems (HDFS), Map/Reduce Programming Framework and NoSQL Databases, Cluster Architecture, Data Flow Systems, Spark, RDDs.
- Overview of Python for big data management: Introduction to libraries and tools (pandas, NumPy, etc.), Introdu