Anteprima
ΥΔΑ102: Τεχνικές διαχείρισης και Εξόρυξης για Δεδομένα Μεγάλου Όγκου (Big Data Management and Mining Methods)
(DDCD102) - Χρήστος Μακρής
Descrizione del Corso
Το μάθημα απευθύνεται σε όσους φοιτητές θέλουν να αποκτήσουν βασικές γνώσεις στην περιοχή των δομών δεδομένων και των αλγοριθμικών τεχνικών που σχετίζονται με τη διαχείριση και επεξεργασία μεγάλου όγκου δεδομένων. Το μάθημα παρουσιάζει μία αρκετά ευρεία (σε σχέση με το πλήθος εφαρμογών και τις χρησιμοποιούμενες τεχνικές) περιοχή στον χώρο της Επιστήμης των Υπολογιστών, και σχετίζεται άμεσα με άλλα προπτυχιακά και μεταπτυχιακά μαθήματα του Προγράμματος Σπουδών όπως: Δομές Δεδομένων, Προχωρημένες Δομές Δεδομένων και Γραφική, Θεωρία Βασικών Δομών Δεδομένων, Αλγόριθμοι, Τεχνολογίες Υλοποίησης Αλγορίθμων. Ψηφιακή Επεξεργασία και Ανάλυση Εικόνας, Ανάκτηση Πληροφορίας, Βάσεις Δεδομένων I, II.
Ουσιαστικά το μάθημα είναι συνέχεια του μ/χ μαθήματος του ΕΤΥ Μέθοδοι και Τεχνολογίες για Διαχείριση Μεγάλου Όγκου Δεδομένων σε συνδυασμό με βασικές έννοιες από το μ/χ μάθημα του ΕΤΥ Θεωρία Βασικών Δομών Δεδομένων, και με επιπλέον διδασκαλία εννοιών από Μηχανική Μάθηση και Εξόρυξη Γνώσης.
Τα θέματα που καλύπτονται στο μάθημα είναι τα ακόλουθα:
- Μοντέλα Υπολογισμού και μετρικές χρονική και χωρικής πολυπλοκότητας
- Διαχρονικές Δομές Δεδομένων, Αλγόριθμοι και Δομές Δεδομένων σε προβλήματα διαχείρισης δενδρικών γραφημάτων
- Ουρές Προτεραιότητας και Αυτοοργανώμενα Δέντρα
- Αλγόριθμοι Διάταξης και Δομές Ψαξίματος στο RAM Μοντέλο Υπολογισμού,
- Μοντέλα Δευτερεύουσας Μνήμης (Μοντέλο δύο επιπέδων, ιεραρχικά μοντέλα, Cache Oblivious μοντέλα)
- Αλγόριθμοι και Δομές Δεδομένων για μοντέλα δύο επιπέδων (B-Trees, R-trees, Αλγόριθμοι Υπολογιστικής Γεωμετρίας στη Δευτερεύουσα Μνήμη, Αλγόριθμοι Διαχείρισης Συμβολοσειρών στη Δευτερεύουσα Μνήμη)
- Αλγόριθμοι και Δομές Δεδομένων για Cache Oblivious Μοντέλα (Cache Oblivious B-Tree, Cache Oblivious Ουρές Προτεραιότητας, Γεωμετρικοί Αλγόριθμοι στο Cache Oblivious Μοντέλο Υπολογισμού)
- Aλγόριθμοι στο Μοντέλο Ροής Δεδομένων
- Εφαρμογές (Xρονικές Βάσεις Δεδομένων, Χωροχρονικές Βάσεις Δεδομένων, Πολυμεσικές Βάσεις Δεδομένων)
- Αλγόριθμοι Εξόρυξης Γνώσης (Ομαδοποίηση, Κανόνες Συσχέτισης και Κατηγοριοποίηση)
Παρατίθεται σχετική βιβλιογραφία που θα χρησιμοποιηθεί στα πλαίσια του μαθήματος:
- J. Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf)
- Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2020 (second edition)
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/)
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch?tze, Introduction to Information Retrieval, Cambridge University Press. 2008.
- Gonzallo Navarro, Compact data structures, a practical approach, 2016 https://www.cambridge.org/core/books/compact-data-structures/68A5983E6F1176181291E235D0B7EB44
- Gerth Stølting Brodal, John Iacono, Markus E. Nebel, Vijaya Ramachandran: Scalable Data Structures (Dagstuhl Seminar 21071) https://drops.dagstuhl.de/opus/volltexte/2021/14348/
- Brodal, Gerth Stølting ; Meyer, Ulrich Carsten ; Nebel, Bernhard E. ; Sedgewick, Robert
Weitere Beteiligte (Hrsg. etc.): Gerth Stølting Brodal and Ulrich Carsten Meyer and Markus E. Nebel and Robert Sedgewick
Data Structures for the Cloud and External Memory Data (Dagstuhl Seminar 19051) https://drops.dagstuhl.de/opus/volltexte/2019/10572/ - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein,
Introduction to Algorithms, Third Edition, MIT Press, 2009 - Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning (https://www.deeplearningbook.org/)
- Tony Hey Stewart Tansley Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery
Published by Microsoft Research | October 2009, https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf - Erik Demaine, Cache Oblivious Algorithms and Data-Structures, in Lecture Notes from the EEF Summer School on Massive Data Sets, Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark, June 27-July 1, 2002, (http://theory.lcs.mit.edu/~edemaine/papers/)
- Lars Arge, External-Memory Geometric Data Structures. In EFF Summer School on Massive Data Sets, 2004 (https://imada.sdu.dk/~rolf/Edu/DM808/F08/Handouts/ionotes.pdf)
- Lars Arge, External-Memory Data Structures. In Handbook of Massive Data Sets, 2002 (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.3293)
- T.Y. Liu, Learning to rank for information retrieval, Springer 2011, https://link.springer.com/book/10.1007/978-3-642-14267-3
- Hanah Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann 2006, https://www.elsevier.com/books/foundations-of-multidimensional-and-metric-data-structures/samet/978-0-12-369446-1
Creation Date
martedì 29 gennaio 2019
-
Μέθοδοι αξιολόγησης
Εξετάσεις
Η εξέταση του μαθήματος συνίσταται σε: (i) προφορική εξέταση πάνω στο περιεχόμενο επιλεγμένων βιβλίων που θα καλύπτουν τα όσα έχουν παρουσιαστεί στο μάθημα, και τα οποία αναφέρονται στη συνέχεια, (ii) σε ωριαία παρουσίαση και γραπτή αναφορά στο περιεχόμενο μίας επιστημονικής δημοσίευσης. H γραπτή αναφορά θα είναι 18-25 σελίδες με τις βασικές ιδέες του paper, καθώς και ο,τι συμπληρωματικό υλικό στο πρόβλημα που πραγματεύεται το paper έχετε συλλέξει.
Ο τελικός βαθμός θα είναι το ημιάθροισμα των δύο βαθμολογιών.
Ύλη Εξέτασης
Η βασική ύλη εξέτασης είναι το βιβλίο του Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf) KAI χρησιμοποιείται υλικό από τα βιβλία των
- Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2020 (second edition)
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/)
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch?tze, Introduction to Information Retrieval, Cambridge University Press. 2008.
- Pretrained Transformers for Text Ranking: BERT and Beyond by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates ( University of Waterloo, University of Campinas, University of Amsterdam) Morgan & Claypool (Synthesis Lectures on Human Language Technologies, edited by Graeme Hirst, volume 53), 2021
(στη λίστα παρατίθεται και βοηθητικό βιβλιογραφικό υλικό):
- J. Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf)
- Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2020 (second edition)
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/)
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch?tze, Introduction to Information Retrieval, Cambridge University Press. 2008.
- Gonzallo Navarro, Compact data structures, a practical approach, 2016 https://www.cambridge.org/core/books/compact-data-structures/68A5983E6F1176181291E235D0B7EB44
- Gerth Stølting Brodal, John Iacono, Markus E. Nebel, Vijaya Ramachandran: Scalable Data Structures (Dagstuhl Seminar 21071) https://drops.dagstuhl.de/opus/volltexte/2021/14348/
- Brodal, Gerth Stølting ; Meyer, Ulrich Carsten ; Nebel, Bernhard E. ; Sedgewick, Robert
Weitere Beteiligte (Hrsg. etc.): Gerth Stølting Brodal and Ulrich Carsten Meyer and Markus E. Nebel and Robert Sedgewick
Data Structures for the Cloud and External Memory Data (Dagstuhl Seminar 19051) https://drops.dagstuhl.de/opus/volltexte/2019/10572/ - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein,
Introduction to Algorithms, Third Edition, MIT Press, 2009 - Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning (https://www.deeplearningbook.org/)
- Tony Hey Stewart Tansley Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery
Published by Microsoft Research | October 2009, https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf - Erik Demaine, Cache Oblivious Algorithms and Data-Structures, in Lecture Notes from the EEF Summer School on Massive Data Sets, Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark, June 27-July 1, 2002, (http://theory.lcs.mit.edu/~edemaine/papers/)
- Lars Arge, External-Memory Geometric Data Structures. In EFF Summer School on Massive Data Sets, 2004 (https://imada.sdu.dk/~rolf/Edu/DM808/F08/Handouts/ionotes.pdf)
- Lars Arge, External-Memory Data Structures. In Handbook of Massive Data Sets, 2002 (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.3293)
- T.Y. Liu, Learning to rank for information retrieval, Springer 2011, https://link.springer.com/book/10.1007/978-3-642-14267-3
- Hanah Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann 2006, https://www.elsevier.com/books/foundations-of-multidimensional-and-metric-data-structures/samet/978-0-12-369446-1
ΠΙΟ ΣΥΓΚΕΚΡΙΜΕΝΑ
η ύλη της προφορικής εξέτασης για το μάθημα είναι:
1. τα κεφάλαια 1, 2, 3, 4 ,5, 8, 10,11,12, 13, 14, 15. από το βιβλίο J. Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf) και από αυτά έμφαση στα όσα ειπώθηκαν στις διαλέξεις και υπάρχουν στις διαφάνειες του e_class
2. επιλεγμένο υλικό από τα βιβλία:
- Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2020 (second edition) https://dataminingbook.info/ κεφ. 8, 10, 13, 14, 16, 17, 22, 25
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/) κεφ. 6, 7,10
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch?tze, Introduction to Information Retrieval, Cambridge University Press. 2008. και από αυτά έμφαση στα όσα ειπώθηκαν στις διαλέξεις και υπάρχουν στις διαφάνειες του e_class κεφ. 6.7
- Pretrained Transformers for Text Ranking: BERT and Beyond by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates ( University of Waterloo, University of Campinas, University of Amsterdam) Morgan & Claypool (Synthesis Lectures on Human Language Technologies, edited by Graeme Hirst, volume 53), 2021
ΕΝΔΕΙΚΤΙΚΕΣ ΔΗΜΟΣΙΕΥΣΕΙΣ ΠΑΝΩ ΣΤΙΣ ΟΠΟΙΕΣ ΘΑ ΒΑΣΙΣΤΕΙΤΕ ΓΙΑ ΤΗΝ ΕΡΓΑΣΙΑ ΣΑΣ.
Θα πρέπει να μου στείλετε email από 1.3 μέχρι 10.3 με 3 επιλογές εκ των οποίων μόνο μία θα είναι σε paper κυριας μνήμης,
External Memory Data Structures
- Vitter, J. S. and Shriver, E. 1994a. Algorithms for parallel memory I: Two-level memories. Algorithmica 12, 2-3, 110-147.
- Vitter, J. S. and Shriver, E. A.1994b. Algorithms for parallel memory II: Hierarchical multilevel.
- S. Lanka, E. Mays, Fully Persistent B+-trees, ACM International Conference on Management of Data, 426-435, 1992.
- Varman P. & Verma R. An Efficient Multiversion Access Structure. IEEE Transactions on Knowledge and Data Engineering, 391-409. 1997.
- B. Becker, S. Gschwind, T. Ohler, B. Seeger, P. Widmayer, An asymptotically optimal multiversion B-tree, The VLDB Journal, 264-275.
- P. Ferragina and R. Grossi. The string B-tree: a new data structure for string search in external memory and its applications. J. ACM, 46(2):236-280, 1999.
- Lars Arge, Mikkel Thorup: RAM-Efficient External Memory Sorting. CoRR abs/1312.2018 (2013) 2012
- Pankaj K. Agarwal, Lars Arge, Sathish Govindarajan, Jun Yang, Ke Yi, Efficient external memory structures for range-aggregate queries. Comput. Geom. 46(3): 358-370 (2013)
- Lars Arge, Michael T. Goodrich, Freek van Walderveen, Computing betweenness centrality in external memory. BigData Conference 2013: 368-375
- Lars Arge, Johannes Fischer, Peter Sanders, Nodari Sitchinava, On (Dynamic) Range Minimum Queries in External Memory. WADS 2013: 37-48R
- Rahul Shah, Cheng Sheng, Sharma V. Thankachan, Jeffrey Scott Vitter, Top-k Document Retrieval in External Memory. ESA 2013: 803-81
- Gerth St?lting Brodal, Konstantinos Tsakalidis, Spyros Sioutas, Kostas Tsichlas:
Fully persistent B-trees. SODA 2012: 602-614 - Michael A. Bender, Martin Farach-Colton
, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, Helen Xu:
Write-Optimized Skip Lists. PODS 2017: 69-78 - Michael A. Bender, Alex Conway, Martin Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, Yang Zhan: External-memory Dictionaries in the Affine and PDAM Models. ACM Trans. Parallel Comput. 8(3): 15:1-15:20 (2021)
- Michael A. Bender, Roozbeh Ebrahimi, Haodong Hu, Bradley C. Kuszmaul:
B-Trees and Cache-Oblivious B-Trees with Different-Sized Atomic Keys. ACM Trans. Database Syst. 41(3): 19:1-19:33 (2016) - Michael A. Bender, Jonathan W. Berry, Rob Johnson, Thomas M. Kroeger, Samuel McCauley, Cynthia A. Phillips, Bertrand Simon, Shikha Singh, David Zage:
Anti-Persistence on Persistent Storage: History-Independent Sparse Tables and Dictionaries. PODS 2016: 289-302 - Gerth Stølting Brodal
, Rolf Fagerberg, David Hammer, Ulrich Meyer, Manuel Penschuck, Hung Tran:
An Experimental Study of External Memory Algorithms for Connected Components. SEA 2021: 23:1-23:23 - Alex Conway, Abhishek Gupta, Vijay Chidambaran, Martin Farach-Colton§ Rick Spillane¶ Amy Tai, Rob Johnson,
SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores, 2020 USENIX Annual Technical Conference - Alex Conway, Martín Farach-Colton , Philip Shilane Optimal Hashing in External Memory, ICALP 2018
- Peter Sanders, Connecting MapReduce Computations to Realistic Machine Models, https://arxiv.org/pdf/2002.07553.pdf
- Athanasios Fevgas
, Leonidas Akritidis, Miltiadis Alamaniotis, Panagiota E. Tsompanopoulou, Panayiotis Bozanis:
HyR-tree: a spatial index for hybrid flash/3D XPoint storage. Neural Comput. Appl. 35(1): 133-145 (2023) - Athanasios Fevgas, Leonidas Akritidis
, Panayiotis Bozanis
, Yannis Manolopoulos:
Indexing in flash storage devices: a survey on challenges, current approaches, and future trends. VLDB J. 29(1): 273-311 (2020) -
Chen Luo, Michael J. Carey, LSM-based storage techniques: a survey, The VLDB Journal 19 July 2019
-
Xin Zhang, Qizhong Mao, Ahmed Eldawy, Vagelis Hristidis, Yihan Sun: Bi-directional Log-Structured Merge Tree. SSDBM 2022: 19:1-19:4
-
Qizhong Mao, Mohiuddin Abdul Qader, Vagelis Hristidis: Comprehensive Comparison of LSM Architectures for Spatial Data. IEEE BigData 2020: 455-460
-
S.-W. Jun, S. Xu et al., "Terabyte sort on fpga-accelerated flash storage", Field-Programmable Custom Computing Machines (FCCM) 2017 IEEE 25th Annual International Symposium on, pp. 17-24, 2017.
-
Wu, Chin-Hsien, and Kuo-Yi Huang. “Data Sorting in Flash Memory.” ACM Transactions on Storage, no. 2, Association for Computing Machinery (ACM), Mar. 2015, pp. 1–25. Crossref, doi:10.1145/2665067.
-
Jackson, Riley, et al. “Efficient External Sorting for Memory-Constrained Embedded Devices with Flash Memory.” ACM Transactions on Embedded Computing Systems, no. 4, Association for Computing Machinery (ACM), Mar. 2021, pp. 1–21. Crossref, doi:10.1145/3446976.
-
A. Laga, J. Boukhobza, F. Singhoff and M. Koskas, "Montres: merge on-the-run external sorting algorithm for large data volumes on ssd based storage systems", IEEE Transactions on Computers, vol. 66, no. 10, pp. 1689-1702, 2017.
-
Yubiao Chen, Jianzhong Li, Hong Gao: FSSort: External Sort For Solid State Drives. CCGRID 2021: 71-80
- Nadir Ould-Khessal, Scott Fazackerley, and Ramon Lawrence, An Efficient B-tree Implementation for Memory-Constrained Embedded Systems, https://arxiv.org/abs/2302.07800, 2023
-
Bender, M.A., Farach-Colton, M., Jannen, W., Johnson, R., Kuszmaul, B.C., Porter, D.E., Yuan, J., Zhan, Y.: An Introduction to B-trees and Write-
Optimization. Usenix Mag. 40(5) (2015) -
Fazackerley, S., Ould-Khessal, N., Lawrence, R.: Efficient flash indexing for time series data on memory-constrained embedded sensor devices. In: Proceedings of
the 10th International Conference on Sensor Networks, SENSORNETS 2021, pp. 92{99. SCITEPRESS (2021). DOI 10.5220/0010318800920099 -
Gerth Stølting Brodal, Casper Moldrup Rysgaard, Rolf Svenning: External Memory Fully Persistent Search Trees. STOC 2023: 1410-1423
-
Gerth Stølting Brodal, Casper Moldrup Rysgaard, Jens Kristian Refsgaard Schou, Rolf Svenning: Space-Efficient Functional Offline-Partially-Persistent Trees with Applications to Planar Point Location. WADS 2023: 644-659
External Memory Geometric Data Structures
- Arge, L. 1995a. The buffer tree: A new technique for optimal I/O-algorithms. In Proceedings of the Workshop on Algorithms and Data Structures, Vol. 955 of Lecture Notes in Computer Science, Springer-Verlag, 334-345, 1995. A complete version appears as BRICS Technical Report RS-96-28, University of Aarhus.
- Arge, L. 1997. External-memory algorithms with applications in geographic information systems. In M. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, eds, Algorithmic Foundations of GIS, Vol. 1340 of Lecture Notes in Computer Science, Springer-Verlag, 213-254.
- Arge, L. and Vahrenhold, J. I/O-efficient dynamic planar point location. In Proceedings of the ACM Symposium on Computational Geometry (June), Vol. 9, 191-200, 2000.
- Kanellakis, P. C.,Ramaswamy, S., Vengroff, D. E., and Vitter, J. S. 1996. Indexing for data models with constraints and classes. Journal of Computer and System Sciences 52, 3, 589-612.
- Arge, L. and Vitter, J. S. Optimal dynamic interval management in external memory. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 1996.
- Arge, L., Samoladas, V., and Vitter, J. S. 1999b Two-dimensional indexability and optimal range search indexing. In Proceedings of the ACMConference Principles of Database Systems (Philadelphia, May-June), Vol. 18, 346-357, 1999
- Lars Arge, Gerth St?lting Brodal, S. Srinivasa Rao: External Memory Planar Point Location with Logarithmic Updates. Algorithmica 63(1-2): 457-475 (2012)
- Lars Arge, Vasilis Samoladas, Ke Yi: Optimal External Memory Planar Point Enclosure. Algorithmica 54(3): 337-352 (2009)
- Peyman Afshani, Lars Arge, Kasper Dalgaard Larsen: Orthogonal Range Reporting in Three and Higher Dimensions. FOCS 2009: 149-158
- Peyman Afshani, Nodari Sitchinava: I/O-Efficient Range Minima Queries. SWAT 2014: 1-12
- Gerth Stølting Brodal
, Rolf Fagerberg, David Hammer, Ulrich Meyer, Manuel Penschuck, Hung Tran:
An Experimental Study of External Memory Algorithms for Connected Components. SEA 2021: 23:1-23:23
Foundations of Multidimensional and Metric Data Structures (ενότητες από το βιβλίο του Hanan Samet -http://www.cs.umd.edu/~hjs/mkbook/content-page.pdf )
- Mutlidimensional Point Data (Quadtrees, k-d trees, Grid Directory, PK-trees)
- Object based and image based image representations .
- Intervals and small rectangles
- High Dimensional Data
BIG DATA MANAGEMENT AND MINING
1. Michael Nelson, Sridhar Radhakrishnan, Chandra N. Sekharan: Queryable Compression on Time-Evolving Social Networks with Streaming. IEEE International Conference on Big Data 2018, 146-15
2. Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim: Key based Deep Data Locality on Hadoop. IEEE International Conference on Big Data 2018,3889-3898
3. Jacob D. Moorman, Qinyi Chen, Thomas K. Tu, Zachary M. Boyd, Andrea L. Bertozzi: Filtering Methods for Subgraph Matching on Multiplex Networks. IEEE International Conference on Big Data 2018,3980-3985
4. Nils Gruschka, Vasileios Mavroeidis, Kamer Vishi, Meiko Jensen: Privacy Issues and Data Protection in Big Data: A Case Study Analysis under GDPR. IEEE International Conference on Big Data 2018,5027-5033
5. Alfredo Cuzzocrea, Fabio Martinelli, Francesco Mercaldo: Improving Machine Learning Tools with Embeddings: Applications to Big Data Security. IEEE International Conference on Big Data 2018,5086-5092
6. Xian Wu, Baoxu Shi, Yuxiao Dong, Chao Huang, Nitesh V. Chawla: Neural Tensor Factorization for Temporal Interaction Learning. WSDM 2019, 537-545
7.Manzil Zaheer, Amr Ahmed, Yuan Wang, Daniel Silva, Marc Najork, Yuchen Wu, Shibani Sanan, Surojit Chatterjee: Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models. WSDM 2019, 186-194
8. Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, Yikun Xian: Finding Subcube Heavy Hitters in Analytics Data Streams. 1705-1714, WWW 2018
9. Xin Luo, Ye Wu, Xin-Shun Xu: Scalable Supervised Discrete Hashing for Large-Scale Search. 1603-1612, WWW 2018
10. Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, David P. Woodruff: Weighted Reservoir Sampling from Distributed Streams. 218-235, ACM PODS 2019
11. Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich: In-Database Learning with Sparse Tensors. 325-340 ACM PODS 2018
12. Rajesh Jayaram, David P. Woodruff: Data Streams with Bounded Deletions. 341-354, ACM PODS 2018
13. Jiaqi Gu, Yugo H. Watanabe, William A. Mazza, Alexander Shkapsky, Mohan Yang, Ling Ding, Carlo Zaniolo: RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark. 467-484, ACM SIGMOD 2019
14. Andrew Carter, Andrew Rodriguez, Yiming Yang, Scott Meyer: Nanosecond Indexing of Graph Data With Hash Maps and VLists. 623-635, ACM SIGMOD 2019
15. Daniel Kocher, Nikolaus Augsten: A Scalable Index for Top-k Subtree Similarity Queries. 1624-1641 ACM SIGMOD 2019
16. Hwanjun Song, Jae-Gil Lee: RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning. 1173-1187, ACM SIGMOD 2018
17. Jia Zou, R. Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, Chris Jermaine:
PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development. 1189-1204 ACM SIGMOD 2018
18. Maaz Bin Safeer Ahmad, Alvin Cheung: Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications. 1205-1220, ACM SIGMOD 201819. Alex Conway, Abhishek Gupta, Vijay Chidambaran, Martin Farach-Colton§ Rick Spillane¶ Amy Tai, Rob Johnson,
SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores, 2020 USENIX Annual Technical Conference20. Saman Ashkiani, Martin Farach-Colton, John D. Owens, A Dynamic Hash Table for the GPU, 2018,
https://arxiv.org/pdf/1710.11246.pdf21. Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, Better GPU Hash Tables, 2021, https://arxiv.org/pdf/2108.07232.pdf
22. Peter Sanders, Connecting MapReduce Computations to Realistic Machine Models, https://arxiv.org/pdf/2002.07553.pdf
23. Lars Gottesbüren, Tobias Heuer, Nikolai Maas, Peter Sanders, Sebastian Schlag: Scalable High-Quality Hypergraph Partitioning. ACM Trans. Algorithms 20(1): 9:1-9:54 (2024)
COMPRESSED DATA STRUCTURES
1. H?ctor Ferrada, Gonzalo Navarro: Lempel-Ziv compressed structures for document retrieval. Inf. Comput. 265: 1-25 (2019)
2. Gonzalo Navarro: Document listing on repetitive collections with guaranteed performance. Theor. Comput. Sci. 772: 58-72 (2019)
3. Alberto Ord??ez Pereira, Gonzalo Navarro, Nieves R. Brisaboa: Grammar compressed sequences with rank/select support. J. Discrete Algorithms 43: 54-71 (2017)
4. Gonzalo Navarro, Yakov Nekrich: Time-Optimal Top-k Document Retrieval. SIAM J. Comput. 46(1): 80-113 (2017)
5. Miguel A. Mart?nez-Prieto, Nieves R. Brisaboa, Rodrigo C?novas, Francisco Claude, Gonzalo Navarro: Practical compressed string dictionaries. Inf. Syst. 56: 73-108 (2016)
6. Cecilia Hern?ndez, Gonzalo Navarro: Compressed representations for web and social graphs. Knowl. Inf. Syst. 40(2): 279-313 (2014)
7. Gonzalo Navarro, Nora Reyes: New dynamic metric indices for secondary memory. Inf. Syst. 59: 48-78 (2016)
8. Francisco Claude, Gonzalo Navarro: Self-Indexed Grammar-Based Compression. Fundam. Inform. 111(3): 313-337 (2011)9, Tejalal Choudhary, Vipul Mishra, Anurag Goswami, Jagannathan Sarangapani, A comprehensive survey on model compression and acceleration
Artifcial Intelligence Review (2020) 53:5113–5155 https://doi.org/10.1007/s10462-020-09816-710. Giosuè Cataldo Marinò, Alessandro Petrini, Dario Malchiodi, Marco Frasca Università degli Studi di Milano, Compact representations of convolutional neural networks via weight pruning and quantization arXiv e-prints
11. Deep neural networks compression: A comparative survey and choice recommendations. Neurocomputing, 2022.
12. Learned Sorted Table Search and Static Indexes in Small-Space Data Models. Data, 2023.
13. Guillermo de Bernardo, Travis Gagie, Susana Ladra, Gonzalo Navarro, Diego Seco: Faster compressed quadtrees. J. Comput. Syst. Sci. 131: 86-104 (2023)
Cache Oblivious Data Structures
- M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS 99), pages 285-297, 1999.
- M. A. Bender, E. Demaine, and M. Farach-Colton."Cache Oblivious B-Trees" Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 399-409, 2000.
- Gerth Stψlting Brodal and Rolf Fagerberg, Funnel Heap - A Cache Oblivious Priority Queue, In Proc. 13th Annual International Symposium on Algorithms and Computation, volume 2518 of Lecture Notes in Computer Science, pages 219-228. Springer Verlag, Berlin, 2002.
- Gianni Franceschini, Roberto Grossi, J. Ian Munro, and Linda Pagli. Implicit B-trees: A New Data Structure for the Dictionary Problem. Journal of Computer and System Sciences, special issue of the 43th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2004.
- Gianni Franceschini and Roberto Grossi. Implicit dictionaries supporting searches and amortized updates in O(log n loglog n). In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 670-678. SIAM, 2003.
- Gianni Franceschini and Roberto Grossi. Implicit dictionaries supporting searches and amortized updates in O(log n loglog n). In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 670-678. SIAM, 2003.
- Arash Farzan, Paolo Ferragina, Gianni Franceschini, J. Ian Munro:Cache-Oblivious Comparison-Based Algorithms on Multisets. ESA 2005: 305-316
- Gerth St?lting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, J. Ian Munro:
Cache-Oblivious Dynamic Dictionaries with Update/Query Tradeoffs. SODA 2010: 1448-1456 - Gerth St?lting Brodal, Casper Kejlberg-Rasmussen: Cache-Oblivious Implicit Predecessor Dictionaries with the Working-Set Property. STACS 2012: 112-123
- Lars Arge, Gerth St?lting Brodal, Jakob Truelsen, Constantinos Tsirogiannis: An Optimal and Practical Cache-Oblivious Algorithm for Computing Multiresolution Rasters. ESA 2013: 61-72
- Peyman Afshani, Chris H. Hamilton, Norbert Zeh:
A general approach for cache-oblivious range reporting and approximate range counting. Comput. Geom. 43(8): 700-712 (2010) - Michael A. Bender, Rezaul Alam Chowdhury, Rathish Das, Rob Johnson, William Kuszmaul, Andrea Lincoln, Quanquan C. Liu, Jayson Lynch, Helen Xu:
Closing the Gap Between Cache-oblivious and Cache-adaptive Analysis. SPAA 2020: 63-73 - Michael A. Bender, Martin Farach-Colton
, Sándor P. Fekete
, Jeremy T. Fineman, Seth Gilbert:
Cost-Oblivious Storage Reallocation. ACM Trans. Algorithms 13(3): 38:1-38:20 (2017) - Gerth Stølting Brodal, Sebastian Wild: Funnelselect: Cache-Oblivious Multiple Selection. ESA 2023: 25:1-25:1
Data structures in the Cloud
- Wang, J., S. Wu, H. Gao, J. Li, and B. C. Ooi (2010). Indexing multi-dimensional data n a cloud system. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, New York, NY, USA, pp. 591-602. ACM.
- 2. Grossman, R. and Y. Gu (2008), Data mining using high performance data clouds: experimental studies using sector and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08, NewYork, NY, USA, pp. 920-927. ACM.
- Sai Wu, Dawei Jiang, Beng Chin Ooi, KunLung Wu, Efficient Btree Based Indexing for Cloud Data Processing, Proceedings of VLDB Endowment,1207-1218
Learned Index data Structures
1. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis: The Case for Learned Index Structures. SIGMOD Conference 2018: 489-504
2. Christos Anagnostopoulos, Peter Triantafillou:Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality. TKDD 11(4): 47:1-47:46 (2017)
3.Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska: Partitioned Learned Bloom Filters. ICLR 2021
4. Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska:
Benchmarking Learned Indexes. Proc. VLDB Endow. 14(1): 1-13 (2020)5, Vikram Nathan, Jialin Ding, Mohammad Alizadeh, Tim Kraska: Learning Multi-Dimensional Indexes. SIGMOD Conference 2020: 985-1000
LSM Trees and SSDs
Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis: Enabling Timely and Persistent Deletion in LSM-Engines. ACM Trans. Database Syst. 48(3): 8:1-8:40 (2023)
Subhadeep Sarkar, Niv Dayan, Manos Athanassoulis: The LSM Design Space and its Read Optimizations. ICDE 2023: 3578-3584
Subhadeep Sarkar, Manos Athanassoulis: Dissecting, Designing, and Optimizing LSM-based Data Stores. SIGMOD Conference 2022: 2489-2497
Tarikul Islam Papon, Manos Athanassoulis: The Need for a New I/O Model. CIDR 2021
Zichen Zhu, Ju Hyoung Mun, Aneesh Raman, Manos Athanassoulis: Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices. DaMoN 2021: 1:1-1:10
Tarikul Islam Papon, Manos Athanassoulis: A Parametric I/O Model for Modern Storage Devices. DaMoN 2021: 2:1-2:11
Athanasios Fevgas, Leonidas Akritidis
, Panayiotis Bozanis
, Yannis Manolopoulos: Indexing in flash storage devices: a survey on challenges, current approaches, and future trends. VLDB J. 29(1): 273-311 (2020)Pretrained transformers for Information Retrieval Ranking
1. Antonio Mallia, Joel Mackenzie, Torsten Suel, Nicola Tonellotto: Faster Learned Sparse Retrieval with Guided Traversal. SIGIR 2022: 1901-1905
2. Antonio Mallia, Omar Khattab, Torsten Suel, Nicola Tonellotto: Learning Passage Impacts for Inverted Indexes. SIGIR 2021: 1723-1727
3. Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia, ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction,
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies4. Karan Samel, Cheng Li, Weize Kong,Tao Chen, Mingyang Zhang, Shaleen Gupta, Swaraj Khadanga, Wensong Xu, Xingyu Wang, Kashyap Kolipaka, Michael Bendersky, Marc Najork,
End-to-End Query TermWeighting, KDD 2023Data Structures for Tree and String Manipulation Search Trees and Priority Queues/sorting (POINTER MACHINE AND RAM ALGORITHMS)
- Richard Cole, Ramesh Hariharan: Tree Pattern Matching to Subset Matching in Linear Time. SIAM J. Comput. 32(4): 1056-1066 (2003)
- Erik D. Demaine, Gad M. Landau, Oren Weimann: On Cartesian Trees and Range Minimum Queries. Algorithmica 68(3): 610-625 (2014)
- Francisco Claude, Gonzalo Navarro, Alberto Ord??ez Pereira: The wavelet matrix: An efficient wavelet tree for large alphabets. Inf. Syst. 47: 15-32 (2015)
- J?r?my Barbay, Francisco Claude, Travis Gagie, Gonzalo Navarro, Yakov Nekrich: Efficient Fully-Compressed Sequence Representations. Algorithmica 69(1): 232-268 (2014)
- Gonzalo Navarro, Yakov Nekrich: Optimal Dynamic Sequence Representations. SIAM J. Comput. 43(5): 1781-1806 (2014) /
- Gonzalo Navarro, Sharma V. Thankachan: Encodings for Range Majority Queries. CPM 2014: 262-272
- Gonzalo Navarro, Sharma V. Thankachan: Optimal Encodings for Range Majority Queries. CoRR abs/1404.2677 (2014)
- Travis Gagie, Juha K?rkk?inen, Gonzalo Navarro, Simon J. Puglisi: Colored range queries and document retrieval. Theor. Comput. Sci. 483: 36-50 (2013)
- Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, Alejandro L?pez-Ortiz: Faster and smaller inverted indices with treaps. SIGIR 2013: 193-202
- G. S. Brodal. Finger Search Trees with Constant Insertion Time. In Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 540-549, 1998.
- Gerth St?lting Brodal, George Lagogiannis, Christos Makris, Athanasios K. Tsakalidis, Kostas Tsichlas: Optimal finger search trees in the pointer machine. J. Comput. Syst. Sci. 67(2): 381-418 (2003
- Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. Journal of the ACM , 32(3):652-686, July 1985.
- Gerth St?lting Brodal, George Lagogiannis, Robert Endre Tarjan: Strict fibonacci heaps. STOC 2012: 1177-1184
- Haim Kaplan, Robert Endre Tarjan, Uri Zwick: Soft Heaps Simplified. SIAM J. Comput. 42(4): 1660-1673 (201
- Arne Andersson, Mikkel Thorup: Dynamic ordered sets with exponential search trees. J. ACM 54(3): 13 (2007)
- Yijie Han: Deterministic sorting in O(nloglogn) time and linear space. J. Algorithms 50(1): 96-105 (2004)
- Mikkel Thorup
, Or Zamir, Uri Zwick: Dynamic Ordered Sets with Approximate Queries, Approximate Heaps and Soft Heaps. ICALP 2019: 95:1-95:13
Βιβλιογραφία
ΒΙΒΛΙΑ
- J. Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf)
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/)
- Tony Hey Stewart Tansley Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery
Published by Microsoft Research | October 2009 - Gonzallo Navarro, Compact data structures, a practical approach 2016. https://www.cambridge.org/core/books/compact-data-structures/68A5983E6F1176181291E235D0B7EB44
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein,
Introduction to Algorithms, Third Edition, MIT Press, 2009 - Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning (https://www.deeplearningbook.org/)
- Erik Demaine, Cache Oblivious Algorithms and Data-Structures, in Lecture Notes from the EEF Summer School on Massive Data Sets, Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark, June 27-July 1, 2002, (http://theory.lcs.mit.edu/~edemaine/papers/)
- Lars Arge, External-Memory Geometric Data Structures. In EFF Summer School on Massive Data Sets, 2004 (https://imada.sdu.dk/~rolf/Edu/DM808/F08/Handouts/ionotes.pdf)
- Lars Arge, External-Memory Data Structures. In Handbook of Massive Data Sets, 2002 (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.3293)
- T.Y. Liu, Learning to rank for information retrieval, Springer 2011, https://link.springer.com/book/10.1007/978-3-642-14267-3
- Hanah Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann 2006, https://www.elsevier.com/books/foundations-of-multidimensional-and-metric-data-structures/samet/978-0-12-369446-1
ΣΥΝΔΕΣΜΟΙ
Σύνδεσμοι
Προτεινόμενα συγγράμματα
To βασικό σύγγραμμα είναι το βιβλίο του Vitter KAI των Jure Leskovec, Anand Rajaraman, Jeff Ullman (στη λίστα παρατίθεται και βοηθητικό βιβλιογραφικό υλικό):
- J. Vitter, Algorithms and Data Structures for External Memory, Book, (http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf)
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 2014 (http://www.mmds.org/)
- Tony Hey Stewart Tansley Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery
Published by Microsoft Research | October 2009 - Gonzallo Navarro, Compact data structures, a practical approach 2016 https://www.cambridge.org/core/books/compact-data-structures/68A5983E6F1176181291E235D0B7EB44
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein,
Introduction to Algorithms, Third Edition, MIT Press, 2009 - Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning (https://www.deeplearningbook.org/)
- Erik Demaine, Cache Oblivious Algorithms and Data-Structures, in Lecture Notes from the EEF Summer School on Massive Data Sets, Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark, June 27-July 1, 2002, (http://theory.lcs.mit.edu/~edemaine/papers/)
- Lars Arge, External-Memory Geometric Data Structures. In EFF Summer School on Massive Data Sets, 2004 (https://imada.sdu.dk/~rolf/Edu/DM808/F08/Handouts/ionotes.pdf)
- Lars Arge, External-Memory Data Structures. In Handbook of Massive Data Sets, 2002 (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.3293)
- T.Y. Liu, Learning to rank for information retrieval, Springer 2011, https://link.springer.com/book/10.1007/978-3-642-14267-3
- Hanah Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann 2006, https://www.elsevier.com/books/foundations-of-multidimensional-and-metric-data-structures/samet/978-0-12-369446-1
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning (https://www.deeplearningbook.org/)