2025
|
 | Rockenbach, Dinei A.; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism Journal Article doi In: Computer Standards & Interfaces, vol. 92, pp. 103922, 2025. @article{ROCKENBACH:GSParLib:CSI:25,
title = {GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism},
author = {Dinei A. Rockenbach and Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1016/j.csi.2024.103922},
doi = {10.1016/j.csi.2024.103922},
year = {2025},
date = {2025-03-01},
urldate = {2025-03-01},
journal = {Computer Standards & Interfaces},
volume = {92},
pages = {103922},
publisher = {Elsevier},
abstract = {The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism. |
2024
|
 | Hoffmann, Renato B.; Griebler, Dalvan; Righi, Rodrigo Rosa; Fernandes, Luiz G. Benchmarking parallel programming for single-board computers Journal Article doi In: Future Generation Computer Systems, vol. 161, pp. 119-134, 2024. @article{HOFFMANN:single-board-computers:FGCS:24,
title = {Benchmarking parallel programming for single-board computers},
author = {Renato B. Hoffmann and Dalvan Griebler and Rodrigo Rosa Righi and Luiz G. Fernandes},
url = {https://doi.org/10.1016/j.future.2024.07.003},
doi = {10.1016/j.future.2024.07.003},
year = {2024},
date = {2024-12-01},
urldate = {2024-12-01},
journal = {Future Generation Computer Systems},
volume = {161},
pages = {119-134},
publisher = {Elsevier},
abstract = {Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware. |
 | Vogel, Adriano; Danelutto, Marco; Torquati, Massimo; Griebler, Dalvan; Fernandes, Luiz Gustavo Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 15, pp. 22213-22244, 2024. @article{VOGEL:Supercomputing:24,
title = {Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores},
author = {Adriano Vogel and Marco Danelutto and Massimo Torquati and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1007/s11227-024-06191-w},
doi = {10.1007/s11227-024-06191-w},
year = {2024},
date = {2024-10-01},
urldate = {2024-10-01},
journal = {The Journal of Supercomputing},
volume = {80},
number = {15},
pages = {22213-22244},
publisher = {Springer},
abstract = {Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency. |
| Guder, Larissa; Aires, João Paulo; Griebler, Dalvan Dimensional Speech Emotion Recognition: a Bimodal Approach Inproceedings doi In: Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web, pp. 5-6, SBC, Juiz de Fora, Brasil, 2024. @inproceedings{GUDER:WEBMEDIA:24,
title = {Dimensional Speech Emotion Recognition: a Bimodal Approach},
author = {Larissa Guder and João Paulo Aires and Dalvan Griebler},
url = {https://doi.org/10.5753/webmedia_estendido.2024.244402},
doi = {10.5753/webmedia_estendido.2024.244402},
year = {2024},
date = {2024-10-01},
booktitle = {Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web},
pages = {5-6},
publisher = {SBC},
address = {Juiz de Fora, Brasil},
abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset. |
| Faé, Leonardo; Griebler, Dalvan An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust Inproceedings doi In: Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação, pp. 81-90, SBC, Curitiba/PR, 2024. @inproceedings{FAE:SBLP:24,
title = {An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust},
author = {Leonardo Faé and Dalvan Griebler},
url = {https://doi.org/10.5753/sblp.2024.3691},
doi = {10.5753/sblp.2024.3691},
year = {2024},
date = {2024-09-01},
booktitle = {Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação},
pages = {81-90},
publisher = {SBC},
address = {Curitiba/PR},
series = {SBLP'24},
abstract = {Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism. |
| Löff, J'unior; Griebler, Dalvan; Fernandes, Luiz Gustavo; Binder, Walter MPR: An MPI Framework for Distributed Self-adaptive Stream Processing Inproceedings doi In: Euro-Par 2024: Parallel Processing, pp. 400-414, Springer, Madrid, Spain, 2024. @inproceedings{LOFF:Euro-Par:24,
title = {MPR: An MPI Framework for Distributed Self-adaptive Stream Processing},
author = {J'unior Löff and Dalvan Griebler and Luiz Gustavo Fernandes and Walter Binder},
url = {https://doi.org/10.1007/978-3-031-69583-4_28},
doi = {10.1007/978-3-031-69583-4_28},
year = {2024},
date = {2024-08-01},
booktitle = {Euro-Par 2024: Parallel Processing},
pages = {400-414},
publisher = {Springer},
address = {Madrid, Spain},
series = {Euro-Par'24},
abstract = {Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations. |
| Gomes, Carlos Falcao Azevedo; Araujo, Adriel Silva; Ahmad, Sunna Imtiaz; Magnaguagno, Mauricio Cecilio; Teixeira, Vinicius Crisosthemos; Rajapuri, Anushri Singh; Roederer, Quinn; Griebler, Dalvan; Dutra, Vinicius; Turkkahraman, Hakan; Pinho, Marcio Sarroglia Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans Inproceedings doi In: 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1977-1982, IEEE, Osaka, Japan, 2024. @inproceedings{GOMES:COMPSAC:24,
title = {Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans},
author = {Carlos Falcao Azevedo Gomes and Adriel Silva Araujo and Sunna Imtiaz Ahmad and Mauricio Cecilio Magnaguagno and Vinicius Crisosthemos Teixeira and Anushri Singh Rajapuri and Quinn Roederer and Dalvan Griebler and Vinicius Dutra and Hakan Turkkahraman and Marcio Sarroglia Pinho},
url = {https://doi.org/10.1109/COMPSAC61105.2024.00316},
doi = {10.1109/COMPSAC61105.2024.00316},
year = {2024},
date = {2024-07-01},
booktitle = {2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)},
pages = {1977-1982},
publisher = {IEEE},
address = {Osaka, Japan},
abstract = {Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data. |
| Guder, Larissa; Aires, João Paulo; Meneguzzi, Felipe; Griebler, Dalvan Dimensional Speech Emotion Recognition from Bimodal Features Inproceedings doi In: Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde, pp. 579-590, SBC, Goiânia, Brasil, 2024. @inproceedings{GUDER:SBCAS:24,
title = {Dimensional Speech Emotion Recognition from Bimodal Features},
author = {Larissa Guder and João Paulo Aires and Felipe Meneguzzi and Dalvan Griebler},
url = {https://doi.org/10.5753/sbcas.2024.2779},
doi = {10.5753/sbcas.2024.2779},
year = {2024},
date = {2024-07-01},
booktitle = {Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde},
pages = {579-590},
publisher = {SBC},
address = {Goiânia, Brasil},
abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset. |
| Leonarczyk, Ricardo; Griebler, Dalvan; Mencagli, Gabriele; Danelutto, Marco Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing Inproceedings doi In: Euro-Par 2023: Parallel Processing Workshops, pp. 81-92, Springer, Limassol, Cyprus, 2024. @inproceedings{LEONARCZYK:Euro-ParW:23,
title = {Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing},
author = {Ricardo Leonarczyk and Dalvan Griebler and Gabriele Mencagli and Marco Danelutto},
url = {https://doi.org/10.1007/978-3-031-50684-0_7},
doi = {10.1007/978-3-031-50684-0_7},
year = {2024},
date = {2024-04-01},
booktitle = {Euro-Par 2023: Parallel Processing Workshops},
pages = {81-92},
publisher = {Springer},
address = {Limassol, Cyprus},
series = {Euro-ParW'23},
abstract = {Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark. |
| Bianchessi, Lucas S.; Faé, Leonardo G.; Hoffmann, Renato B.; Griebler, Dalvan Analisando Paralelismo de Dados em Rust Usando o Método do Gradiente Conjugado Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 9-12, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{BIANCHESSI:ERAD:24,
title = {Analisando Paralelismo de Dados em Rust Usando o Método do Gradiente Conjugado},
author = {Lucas S. Bianchessi and Leonardo G. Faé and Renato B. Hoffmann and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2024.238677},
doi = {10.5753/eradrs.2024.238677},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {9-12},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Em meio ao ambiente da computação de alto desempenho, a linguagem Rust vem se tornando cada vez mais popular, prometendo segurança, desempenho e um ambiente de desenvolvimento moderno. Afim de analisar a viabilidade e eficiência do Rust, foi utilizado método do gradiente conjugado do NPB benchmarks. Os resultados demonstraram resultados paralelos comparáveis ao C++, e perda de desempenho na versão sequencial.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Em meio ao ambiente da computação de alto desempenho, a linguagem Rust vem se tornando cada vez mais popular, prometendo segurança, desempenho e um ambiente de desenvolvimento moderno. Afim de analisar a viabilidade e eficiência do Rust, foi utilizado método do gradiente conjugado do NPB benchmarks. Os resultados demonstraram resultados paralelos comparáveis ao C++, e perda de desempenho na versão sequencial. |
| Hoffmann, Renato B. Em Direção à Programação Distribuída na Seleção de Planos em Sistemas Multi-Agentes Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 121-122, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{HOFFMANN:ERAD:24,
title = {Em Direção à Programação Distribuída na Seleção de Planos em Sistemas Multi-Agentes},
author = {Renato B. Hoffmann},
url = {https://doi.org/10.5753/eradrs.2024.238734},
doi = {10.5753/eradrs.2024.238734},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {121-122},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Sistemas multi-agentes são compostos por múltiplos agentes autônomos que interagem com um ambiente e entre si para atingir objetivos específicos. Jason, uma linguagem multi-agentes popular modela esses sistemas através de crenças, metas e planos, que é atualmente realizada através de uma varredura linear. Sendo assim, essa pesquisa propõem investigar a seleção dos planos de maneira paralela e em um sistema distribuído.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Sistemas multi-agentes são compostos por múltiplos agentes autônomos que interagem com um ambiente e entre si para atingir objetivos específicos. Jason, uma linguagem multi-agentes popular modela esses sistemas através de crenças, metas e planos, que é atualmente realizada através de uma varredura linear. Sendo assim, essa pesquisa propõem investigar a seleção dos planos de maneira paralela e em um sistema distribuído. |
| Alf, Lucas M.; Griebler, Dalvan Tolerância a Falhas para Paralelismo de Stream de Alto Nível Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 119-120, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{ALF:ERAD:24,
title = {Tolerância a Falhas para Paralelismo de Stream de Alto Nível },
author = {Lucas M. Alf and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2024.238679},
doi = {10.5753/eradrs.2024.238679},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {119-120},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Dada a necessidade dos sistemas de processamento de stream serem executados por longos períodos de tempo, possivelmente indefinidamente, realizar o reprocessamento de todos os dados em caso de falha pode ser altamente custoso ou até mesmo inviável. Nesta pesquisa, propomos investigar como fornecer mecanismos de tolerância a falhas e garantias de consistência para paralelismo de stream distribuído em alto nível.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Dada a necessidade dos sistemas de processamento de stream serem executados por longos períodos de tempo, possivelmente indefinidamente, realizar o reprocessamento de todos os dados em caso de falha pode ser altamente custoso ou até mesmo inviável. Nesta pesquisa, propomos investigar como fornecer mecanismos de tolerância a falhas e garantias de consistência para paralelismo de stream distribuído em alto nível. |
| Faé, Leonardo G.; Griebler, Dalvan Proposta de Pipelines Lineares de Alto Nível em Rust Utilizando GPU Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 105-106, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{FAE:ERAD:24,
title = {Proposta de Pipelines Lineares de Alto Nível em Rust Utilizando GPU},
author = {Leonardo G. Faé and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2024.238565},
doi = {10.5753/eradrs.2024.238565},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {105-106},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Unidades de Processamento Gráfico (GPUs) são unidades de hardware projetadas para processar quantidades massivas de dados em paralelo. Rust é uma nova linguagem de programação de baixo nível com foco em desempenho e segurança. Até o momento, há poucos trabalhos acadêmicos sobre abstrações de alto nível para GPUs em Rust. Propomos uma possível abstração, baseada no padrão de pipeline e implementada utilizando macros procedurais.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Unidades de Processamento Gráfico (GPUs) são unidades de hardware projetadas para processar quantidades massivas de dados em paralelo. Rust é uma nova linguagem de programação de baixo nível com foco em desempenho e segurança. Até o momento, há poucos trabalhos acadêmicos sobre abstrações de alto nível para GPUs em Rust. Propomos uma possível abstração, baseada no padrão de pipeline e implementada utilizando macros procedurais. |
| Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo Em direção a um modelo de programação paralela único para CPUs e GPUs em processamento de stream Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 103-104, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{ARAUJO:ERAD:24,
title = {Em direção a um modelo de programação paralela único para CPUs e GPUs em processamento de stream },
author = {Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/eradrs.2024.238670},
doi = {10.5753/eradrs.2024.238670},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {103-104},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Este trabalho apresenta resultados parciais da pesquisa em andamento, a qual está utilizando a Linguagem Específica de Domínio (DSL) SPar para prototipar um modelo de programação paralela único direcionado a CPUs e GPUs em processamento de stream. Por meio do protótipo inicial, já é possível gerar código paralelo para CPUs e GPUs em processamento de stream.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este trabalho apresenta resultados parciais da pesquisa em andamento, a qual está utilizando a Linguagem Específica de Domínio (DSL) SPar para prototipar um modelo de programação paralela único direcionado a CPUs e GPUs em processamento de stream. Por meio do protótipo inicial, já é possível gerar código paralelo para CPUs e GPUs em processamento de stream. |
| Fim, Gabriel Rustick; Griebler, Dalvan Proposta de Paralelismo de Stream Multi-GPU em Multi-Cores Inproceedings doi In: Anais da XXIV Escola Regional de Alto Desempenho da Região Sul, pp. 101-102, Sociedade Brasileira de Computação, Florianópolis, Brazil, 2024. @inproceedings{FIM:ERAD:24,
title = {Proposta de Paralelismo de Stream Multi-GPU em Multi-Cores },
author = {Gabriel Rustick Fim and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2024.238680},
doi = {10.5753/eradrs.2024.238680},
year = {2024},
date = {2024-04-01},
booktitle = {Anais da XXIV Escola Regional de Alto Desempenho da Região Sul},
pages = {101-102},
publisher = {Sociedade Brasileira de Computação},
address = {Florianópolis, Brazil},
abstract = {Considerando a necessidade de tempos de processamento mais rápidos, a utilização de ambientes multi-aceleradores vem se tornando cada vez mais proeminente na literatura, infelizmente programar para estes tipos de ambientes apresenta uma série de desafios que fazem com que o desenvolvimento de códigos direcionados a multi-GPUs exija um maior esforço de programação. Propomos investigar como utilizar anotações C++ para simplificar a geração de código multi-GPU sem comprometer o desempenho.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Considerando a necessidade de tempos de processamento mais rápidos, a utilização de ambientes multi-aceleradores vem se tornando cada vez mais proeminente na literatura, infelizmente programar para estes tipos de ambientes apresenta uma série de desafios que fazem com que o desenvolvimento de códigos direcionados a multi-GPUs exija um maior esforço de programação. Propomos investigar como utilizar anotações C++ para simplificar a geração de código multi-GPU sem comprometer o desempenho. |
 | Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo Performance and programmability of GrPPI for parallel stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 9, pp. 12966-13000, 2024. @article{GARCIA:JS:24,
title = {Performance and programmability of GrPPI for parallel stream processing on multi-cores},
author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1007/s11227-024-05934-z},
doi = {10.1007/s11227-024-05934-z},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
journal = {The Journal of Supercomputing},
volume = {80},
number = {9},
pages = {12966-13000},
publisher = {Springer},
abstract = {GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times. |
 | Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Fais, Alessandra; Danelutto, Marco General-purpose data stream processing on heterogeneous architectures with WindFlow Journal Article doi In: Journal of Parallel and Distributed Computing, vol. 184, pp. 104782, 2024. @article{MENCAGLI:JPDC:24,
title = {General-purpose data stream processing on heterogeneous architectures with WindFlow},
author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Alessandra Fais and Marco Danelutto},
url = {https://doi.org/10.1016/j.jpdc.2023.104782},
doi = {10.1016/j.jpdc.2023.104782},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
journal = {Journal of Parallel and Distributed Computing},
volume = {184},
pages = {104782},
publisher = {Elsevier},
abstract = {Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink. |
 | Fischer, Gabriel Souto; Ramos, Gabriel Oliveira; Costa, Cristiano André; Alberti, Antonio Marcos; Griebler, Dalvan; Singh, Dhananjay; Righi, Rodrigo Rosa Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0 Journal Article doi In: IoT, vol. 5, no. 2, pp. 381-408, 2024. @article{FISCHER:IoT:24,
title = {Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0},
author = {Gabriel Souto Fischer and Gabriel Oliveira Ramos and Cristiano André Costa and Antonio Marcos Alberti and Dalvan Griebler and Dhananjay Singh and Rodrigo Rosa Righi},
url = {https://doi.org/10.3390/iot5020019},
doi = {10.3390/iot5020019},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {IoT},
volume = {5},
number = {2},
pages = {381-408},
publisher = {MDPI},
abstract = {Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%. |
2023
|
| Hoffmann, Renato Barreto; Faé, Leonardo; Manssour, Isabel; Griebler, Dalvan Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm Inproceedings doi In: International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 1-8, IEEE, Porto Alegre, Brazil, 2023. @inproceedings{HOFFMANN:SBAC-PADW:23,
title = {Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm},
author = {Renato Barreto Hoffmann and Leonardo Faé and Isabel Manssour and Dalvan Griebler},
url = {https://doi.org/10.1109/SBAC-PADW60351.2023.00017},
doi = {10.1109/SBAC-PADW60351.2023.00017},
year = {2023},
date = {2023-10-01},
booktitle = {International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)},
pages = {1-8},
publisher = {IEEE},
address = {Porto Alegre, Brazil},
series = {SBAC-PADW'23},
abstract = {Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results. |
| Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 181-192, SBC, Porto Alegre, Brasil, 2023. @inproceedings{ANDRADE:WSCAD:23,
title = {Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications},
author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/wscad.2023.235925},
doi = {10.5753/wscad.2023.235925},
year = {2023},
date = {2023-10-01},
booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)},
pages = {181-192},
publisher = {SBC},
address = {Porto Alegre, Brasil},
abstract = {Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students. |
| Alf, Lucas; Hoffmann, Renato Barreto; Müller, Caetano; Griebler, Dalvan Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 61-72, SBC, Porto Alegre, Brasil, 2023. @inproceedings{ALF:WSCAD:23,
title = {Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados},
author = {Lucas Alf and Renato Barreto Hoffmann and Caetano Müller and Dalvan Griebler},
url = {https://doi.org/10.5753/wscad.2023.235915},
doi = {10.5753/wscad.2023.235915},
year = {2023},
date = {2023-10-01},
booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)},
pages = {61-72},
publisher = {SBC},
address = {Porto Alegre, Brasil},
abstract = {Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU. |
| Bianchessi, Arthur S.; Mallmann, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Conversão do NAS Parallel Benchmarks para C++ Standard Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 313-324, SBC, Porto Alegre, Brasil, 2023. @inproceedings{BIANCHESSI:WSCAD:23,
title = {Conversão do NAS Parallel Benchmarks para C++ Standard},
author = {Arthur S. Bianchessi and Leonardo Mallmann and Renato Barreto Hoffmann and Dalvan Griebler},
url = {https://doi.org/10.5753/wscad.2023.235913},
doi = {10.5753/wscad.2023.235913},
year = {2023},
date = {2023-10-01},
booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)},
pages = {313-324},
publisher = {SBC},
address = {Porto Alegre, Brasil},
abstract = {A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa. |
| Faé, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism Inproceedings doi In: XXVII Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Campo Grande, Brazil, 2023. @inproceedings{FAE:SBLP:23,
title = {Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism},
author = {Leonardo Faé and Renato Barreto Hoffmann and Dalvan Griebler},
url = {https://doi.org/10.1145/3624309.3624320},
doi = {10.1145/3624309.3624320},
year = {2023},
date = {2023-09-01},
booktitle = {XXVII Brazilian Symposium on Programming Languages (SBLP)},
pages = {41-49},
publisher = {ACM},
address = {Campo Grande, Brazil},
series = {SBLP'23},
abstract = {Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions. |
| Faé, Leonardo; Griebler, Dalvan; Manssour, Isabel Benchmarking da Aplicação de Comparação de Similaridade entre Imagens com Flink, Storm e SPar Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 93-96, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{FAE:ERAD:23,
title = {Benchmarking da Aplicação de Comparação de Similaridade entre Imagens com Flink, Storm e SPar},
author = {Leonardo Faé and Dalvan Griebler and Isabel Manssour},
url = {https://doi.org/10.5753/eradrs.2023.229258},
doi = {10.5753/eradrs.2023.229258},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {93-96},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Este trabalho apresenta comparações de desempenho entre as interfaces de programação SPar, Apache Flink e Apache Storm, no que diz respeito à execução de uma aplicação de comparação de imagens. Os resultados revelam que as versões da SPar apresentam um desempenho superior quando executadas com um grande número de threads, tanto em termos de latência quanto de throughput (a SPar tem um throughput cerca de 5 vezes maior com 40 workers).},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este trabalho apresenta comparações de desempenho entre as interfaces de programação SPar, Apache Flink e Apache Storm, no que diz respeito à execução de uma aplicação de comparação de imagens. Os resultados revelam que as versões da SPar apresentam um desempenho superior quando executadas com um grande número de threads, tanto em termos de latência quanto de throughput (a SPar tem um throughput cerca de 5 vezes maior com 40 workers). |
| Mallmann, Leonardo; Bianchessi, Arthur; Griebler, Dalvan Impacto da biblioteca padrão do C++ nos Kernels do NAS Parallel Benchmarks Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 89-92, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{MALLMANN:ERAD:23,
title = {Impacto da biblioteca padrão do C++ nos Kernels do NAS Parallel Benchmarks},
author = {Leonardo Mallmann and Arthur Bianchessi and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229236},
doi = {10.5753/eradrs.2023.229236},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {89-92},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {A programação paralela nativa na linguagem C++ ganhou força com os std algorithms e suas políticas de execução paralela. Para que seja possível a aplicação destes recursos, porém, é necessário a incorporação no código das estruturas de dados sobre as quais tais funções possam operar. Mesmo adicionando uma camada de abstração maior através de tais estruturas, observou-se um tempo de execução similar à versão em C.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
A programação paralela nativa na linguagem C++ ganhou força com os std algorithms e suas políticas de execução paralela. Para que seja possível a aplicação destes recursos, porém, é necessário a incorporação no código das estruturas de dados sobre as quais tais funções possam operar. Mesmo adicionando uma camada de abstração maior através de tais estruturas, observou-se um tempo de execução similar à versão em C. |
| Cunha, Lucas; Hoffmann, Renato; Griebler, Dalvan; Manssour, Isabel Avaliação do Paralelismo dos Kernels EP e CG em Sistemas Embarcados Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 57-60, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{CUNHA:ERAD:23,
title = {Avaliação do Paralelismo dos Kernels EP e CG em Sistemas Embarcados},
author = {Lucas Cunha and Renato Hoffmann and Dalvan Griebler and Isabel Manssour},
url = {https://doi.org/10.5753/eradrs.2023.229264},
doi = {10.5753/eradrs.2023.229264},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {57-60},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Nesse artigo, testamos o ganho de desempenho obtido ao implementar códigos com processamento paralelo em sistemas embarcados genéricos. Para analisar ao desempenho em relação ao Speedup ideal, foram testados dois algoritmos (EP e CG) paralelos em dois sistemas embarcados diferentes. Os resultados mostram uma discrepância entre o melhor (3.98X) e o pior (1.38X) desempenho obtidos, indicando o tamanho do espectrum de desempenho.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Nesse artigo, testamos o ganho de desempenho obtido ao implementar códigos com processamento paralelo em sistemas embarcados genéricos. Para analisar ao desempenho em relação ao Speedup ideal, foram testados dois algoritmos (EP e CG) paralelos em dois sistemas embarcados diferentes. Os resultados mostram uma discrepância entre o melhor (3.98X) e o pior (1.38X) desempenho obtidos, indicando o tamanho do espectrum de desempenho. |
| Müller, Caetano; Griebler, Dalvan Um estudo sobre uso do MPI para uma aplicação de detecção de picos em data streams Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 97-100, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{MULLER:ERAD:23,
title = {Um estudo sobre uso do MPI para uma aplicação de detecção de picos em data streams},
author = {Caetano Müller and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229253},
doi = {10.5753/eradrs.2023.229253},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {97-100},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Aplicações de data stream podem ser implementadas com diferentes interfaces de programação paralela. Neste artigo, realizou-se um estudo e implementação da aplicação Spike Detection com MPI e a comparou-se com versões usando Flink, Storm e Windflow. Avaliou-se o trouhgput e conclui-se que a implementação com Windflow apresenta o melhor desempenho, enquanto as versões com MPI tiveram um throughput inferior as demais soluções.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Aplicações de data stream podem ser implementadas com diferentes interfaces de programação paralela. Neste artigo, realizou-se um estudo e implementação da aplicação Spike Detection com MPI e a comparou-se com versões usando Flink, Storm e Windflow. Avaliou-se o trouhgput e conclui-se que a implementação com Windflow apresenta o melhor desempenho, enquanto as versões com MPI tiveram um throughput inferior as demais soluções. |
| Bianchessi, Arthur; Mallmann, Leonardo; Griebler, Dalvan Avaliação do paralelismo nos kernels NAS Parallel Benchmarks usando estruturas de dados da biblioteca C++ Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 61-64, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{BIANCHESSI:ERAD:23,
title = {Avaliação do paralelismo nos kernels NAS Parallel Benchmarks usando estruturas de dados da biblioteca C++},
author = {Arthur Bianchessi and Leonardo Mallmann and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229266},
doi = {10.5753/eradrs.2023.229266},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {61-64},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {O conjunto de aplicações NAS Parallel Benchmark (NPB) é projetado para avaliar a eficiência da paralelização em sistemas computacionais. Nesse estudo, a versão NPB-CPP foi adaptada para utilizar a C++ Standard Library e seu desempenho foi avaliado. Os resultados apontaram para uma boa performance nos kernels EP, FT e CG. Entretanto, apresentou uma degradação no desempenho dos kernels MG e IS.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
O conjunto de aplicações NAS Parallel Benchmark (NPB) é projetado para avaliar a eficiência da paralelização em sistemas computacionais. Nesse estudo, a versão NPB-CPP foi adaptada para utilizar a C++ Standard Library e seu desempenho foi avaliado. Os resultados apontaram para uma boa performance nos kernels EP, FT e CG. Entretanto, apresentou uma degradação no desempenho dos kernels MG e IS. |
| Zomer, Bernardo; Hoffmann, Renato; Griebler, Dalvan Implementação e Avaliação de Desempenho da Linguagem Rust no NAS Embarassingly Parallel Benchmark Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 53-56, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{ZOMER:ERAD:23,
title = {Implementação e Avaliação de Desempenho da Linguagem Rust no NAS Embarassingly Parallel Benchmark},
author = {Bernardo Zomer and Renato Hoffmann and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229261},
doi = {10.5753/eradrs.2023.229261},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {53-56},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Rust é uma linguagem multiparadigmática de alto desempenho que garante segurança de memória. NAS Parallel Benchmarks engloba aplicações paralelas de computação de dinâmica de fluidos, possuindo versões em Fortran e C++. Neste trabalho, a aplicação EP foi convertida para Rust e paralelizada com as bibliotecas Rayon e Rust SSP. Na avaliação de desempenho, Rust demonstrou a melhor escalabilidade no paralelismo quando foi usado Rust SSP.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Rust é uma linguagem multiparadigmática de alto desempenho que garante segurança de memória. NAS Parallel Benchmarks engloba aplicações paralelas de computação de dinâmica de fluidos, possuindo versões em Fortran e C++. Neste trabalho, a aplicação EP foi convertida para Rust e paralelizada com as bibliotecas Rayon e Rust SSP. Na avaliação de desempenho, Rust demonstrou a melhor escalabilidade no paralelismo quando foi usado Rust SSP. |
| Eichner, Eduardo; Andrade, Gabriella; Griebler, Dalvan; Fernandes, Luiz Gustavo Análise de Correlação no Esforço de Desenvolvimento de Aplicações Paralelas Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 49-52, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{EICHNER:ERAD:23,
title = {Análise de Correlação no Esforço de Desenvolvimento de Aplicações Paralelas},
author = {Eduardo Eichner and Gabriella Andrade and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/eradrs.2023.229265},
doi = {10.5753/eradrs.2023.229265},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {49-52},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Neste trabalho, foram analisadas diversas métricas, voltadas para uma aplicação de processamento de vídeo, utilizando as interfaces FastFlow, TBB e SPar. Os resultados revelam que utilizando a SPar e o FastFlow é possível desenvolver uma aplicação paralela eficiente com menos esforço, ao contrário do TBB. Em trabalhos futuros planejamos incluir mais aplicações no dataset a fim de confirmar os resultados.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neste trabalho, foram analisadas diversas métricas, voltadas para uma aplicação de processamento de vídeo, utilizando as interfaces FastFlow, TBB e SPar. Os resultados revelam que utilizando a SPar e o FastFlow é possível desenvolver uma aplicação paralela eficiente com menos esforço, ao contrário do TBB. Em trabalhos futuros planejamos incluir mais aplicações no dataset a fim de confirmar os resultados. |
| Gaspary, Pedro; Müller, Caetano; Griebler, Dalvan; Eizirik, Eduardo Avaliação do paralelismo em classificadores taxonômicos de sequências de rRNA usando Qiime2 Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 13-16, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{GASPARY:ERAD:23,
title = {Avaliação do paralelismo em classificadores taxonômicos de sequências de rRNA usando Qiime2},
author = {Pedro Gaspary and Caetano Müller and Dalvan Griebler and Eduardo Eizirik},
url = {https://doi.org/10.5753/eradrs.2023.229241},
doi = {10.5753/eradrs.2023.229241},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {13-16},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Classificação de sequências de rRNA é de suma importância para análise de microbiomas. Portanto, este trabalho avaliou o desempenho e eficiência do paralelismo de três algoritmos de classificação taxonômica do Qiime2. Entre eles, o VSearch apresentou a melhor eficiência na paralelização, mas também os maiores tempos de execução. Os outros dois, Naive-Bayes e Hybrid, apresentaram desempenho similar entre si, este sendo mais rápido até o quinto grau de paralelismo, e consumindo pouco menos memória que aquele.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Classificação de sequências de rRNA é de suma importância para análise de microbiomas. Portanto, este trabalho avaliou o desempenho e eficiência do paralelismo de três algoritmos de classificação taxonômica do Qiime2. Entre eles, o VSearch apresentou a melhor eficiência na paralelização, mas também os maiores tempos de execução. Os outros dois, Naive-Bayes e Hybrid, apresentaram desempenho similar entre si, este sendo mais rápido até o quinto grau de paralelismo, e consumindo pouco menos memória que aquele. |
| Hoffmann, Renato Barreto; Griebler, Dalvan Avaliando Paralelismo em Dispositivos com Recursos Limitados Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 105-106, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{HOFFMANN:ERAD:23,
title = {Avaliando Paralelismo em Dispositivos com Recursos Limitados},
author = {Renato Barreto Hoffmann and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229269},
doi = {10.5753/eradrs.2023.229269},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {105-106},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Sistemas computacionais com recursos limitados dispõem cada vez mais recursos computacionais. Portanto, para atender às restrições de desempenho e obter um baixo consumo de energia, é necessário utilizar o paralelismo. Este trabalho avaliou 4 aplicações em 3 dispositivos diferentes comparando 5 interfaces de paralelismo.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Sistemas computacionais com recursos limitados dispõem cada vez mais recursos computacionais. Portanto, para atender às restrições de desempenho e obter um baixo consumo de energia, é necessário utilizar o paralelismo. Este trabalho avaliou 4 aplicações em 3 dispositivos diferentes comparando 5 interfaces de paralelismo. |
| Leonarczyk, Ricardo; Griebler, Dalvan Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 123-124, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{LEONARCZYK:ERAD:23,
title = {Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs},
author = {Ricardo Leonarczyk and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229267},
doi = {10.5753/eradrs.2023.229267},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {123-124},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação. |
| Fim, Gabriel Rustick; Griebler, Dalvan Implementação e Avaliação do Paralelismo de Flink nas Aplicações de Processamento de Log e Análise de Cliques Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 69-72, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{larcc:FIM:ERAD:23,
title = {Implementação e Avaliação do Paralelismo de Flink nas Aplicações de Processamento de Log e Análise de Cliques},
author = {Gabriel Rustick Fim and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229290},
doi = {10.5753/eradrs.2023.229290},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {69-72},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {Este trabalho visou implementar e avaliar o desempenho das aplicações de Processamento de Log e Análise de Cliques no Apache Flink, comparando o desempenho com Apache Storm em um ambiente computacional distribuído. Os resultados mostram que a execução em Flink apresenta um consumo de recursos relativamente menor quando comparada a execução em Storm, mas possui um desvio padrão alto expondo um desbalanceamento de carga em execuções onde algum componente da aplicação é replicado.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este trabalho visou implementar e avaliar o desempenho das aplicações de Processamento de Log e Análise de Cliques no Apache Flink, comparando o desempenho com Apache Storm em um ambiente computacional distribuído. Os resultados mostram que a execução em Flink apresenta um consumo de recursos relativamente menor quando comparada a execução em Storm, mas possui um desvio padrão alto expondo um desbalanceamento de carga em execuções onde algum componente da aplicação é replicado. |
| Dopke, Luan; Griebler, Dalvan Estudo Sobre Spark nas Aplicações de Processamento de Log e Análise de Cliques Inproceedings doi In: Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 85-88, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. @inproceedings{larcc:DOPKE:ERAD:23,
title = {Estudo Sobre Spark nas Aplicações de Processamento de Log e Análise de Cliques},
author = {Luan Dopke and Dalvan Griebler},
url = {https://doi.org/10.5753/eradrs.2023.229298},
doi = {10.5753/eradrs.2023.229298},
year = {2023},
date = {2023-05-01},
booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul},
pages = {85-88},
publisher = {Sociedade Brasileira de Computação},
address = {Porto Alegre, Brazil},
abstract = {O uso de aplicações de processamento de dados de fluxo contínuo vem crescendo cada vez mais, dado este fato o presente estudo visa mensurar a desempenho do framework Apache Spark Strucutured Streaming perante o framework Apache Storm nas aplicações de fluxo contínuo de dados, estas sendo processamento de logs e análise de cliques. Os resultados demonstram melhor desempenho para o Apache Storm em ambas as aplicações.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
O uso de aplicações de processamento de dados de fluxo contínuo vem crescendo cada vez mais, dado este fato o presente estudo visa mensurar a desempenho do framework Apache Spark Strucutured Streaming perante o framework Apache Storm nas aplicações de fluxo contínuo de dados, estas sendo processamento de logs e análise de cliques. Os resultados demonstram melhor desempenho para o Apache Storm em ambas as aplicações. |
 | Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo A parallel programming assessment for stream processing applications on multi-core systems Journal Article doi In: Computer Standards & Interfaces, vol. 84, pp. 103691, 2023. @article{ANDRADE:CSI:2023,
title = {A parallel programming assessment for stream processing applications on multi-core systems},
author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1016/j.csi.2022.103691},
doi = {10.1016/j.csi.2022.103691},
year = {2023},
date = {2023-03-01},
journal = {Computer Standards & Interfaces},
volume = {84},
pages = {103691},
publisher = {Elsevier},
abstract = {Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved. |
| Vogel, Adriano; Danelutto, Marco; Griebler, Dalvan; Fernandes, Luiz Gustavo Revisiting self-adaptation for efficient decision-making at run-time in parallel executions Inproceedings doi In: 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 43-50, IEEE, Naples, Italy, 2023. @inproceedings{VOGEL:PDP:23,
title = {Revisiting self-adaptation for efficient decision-making at run-time in parallel executions},
author = {Adriano Vogel and Marco Danelutto and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1109/PDP59025.2023.00015},
doi = {10.1109/PDP59025.2023.00015},
year = {2023},
date = {2023-03-01},
booktitle = {31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
pages = {43-50},
publisher = {IEEE},
address = {Naples, Italy},
series = {PDP'23},
abstract = {Self-adaptation is a potential alternative to provide a higher level of autonomic abstractions and run-time responsiveness in parallel executions. However, the recurrent problem is that self-adaptation is still limited in flexibility and efficiency. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations should be conveniently enforced at run-time. In this work, we are interested in providing and evaluating potential abstractions achievable with self-adaptation transparently managing parallel executions. Therefore, we provide a new mechanism to support self-adaptation in applications with multiple parallel stages executed in multi-cores. Moreover, we reproduce, reimplement, and evaluate an existing decision-making strategy in our scenario. The observations from the results show that the proposed mechanism for self-adaptation can provide new parallelism abstractions and autonomous responsiveness at run-time. On the other hand, there is a need for more accurate decision-making strategies to enable efficient executions of applications in resource-constrained scenarios like multi-cores.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Self-adaptation is a potential alternative to provide a higher level of autonomic abstractions and run-time responsiveness in parallel executions. However, the recurrent problem is that self-adaptation is still limited in flexibility and efficiency. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations should be conveniently enforced at run-time. In this work, we are interested in providing and evaluating potential abstractions achievable with self-adaptation transparently managing parallel executions. Therefore, we provide a new mechanism to support self-adaptation in applications with multiple parallel stages executed in multi-cores. Moreover, we reproduce, reimplement, and evaluate an existing decision-making strategy in our scenario. The observations from the results show that the proposed mechanism for self-adaptation can provide new parallelism abstractions and autonomous responsiveness at run-time. On the other hand, there is a need for more accurate decision-making strategies to enable efficient executions of applications in resource-constrained scenarios like multi-cores. |
| Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores Inproceedings doi In: 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 164-168, IEEE, Naples, Italy, 2023. @inproceedings{GARCIA:PDP:23,
title = {A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores},
author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1109/PDP59025.2023.00033},
doi = {10.1109/PDP59025.2023.00033},
year = {2023},
date = {2023-03-01},
booktitle = {31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
pages = {164-168},
publisher = {IEEE},
address = {Naples, Italy},
series = {PDP'23},
abstract = {Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications. |
 | Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi In: Computing, vol. 105, no. 5, pp. 1077-1099, 2023. @article{GARCIA:Computing:23,
title = {SPBench: a framework for creating benchmarks of stream processing applications},
author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1007/s00607-021-01025-6},
doi = {10.1007/s00607-021-01025-6},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {Computing},
volume = {105},
number = {5},
pages = {1077-1099},
publisher = {Springer},
abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures. |
 | Araujo, Gabriell; Griebler, Dalvan; Rockenbach, Dinei A.; Danelutto, Marco; Fernandes, Luiz Gustavo NAS Parallel Benchmarks with CUDA and Beyond Journal Article doi In: Software: Practice and Experience, vol. 53, no. 1, pp. 53-80, 2023. @article{ARAUJO:SPE:23,
title = {NAS Parallel Benchmarks with CUDA and Beyond},
author = {Gabriell Araujo and Dalvan Griebler and Dinei A. Rockenbach and Marco Danelutto and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1002/spe.3056},
doi = {10.1002/spe.3056},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {Software: Practice and Experience},
volume = {53},
number = {1},
pages = {53-80},
publisher = {Wiley},
abstract = {NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. |
 | Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Micro-batch and data frequency for stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 79, no. 8, pp. 9206-9244, 2023. @article{GARCIA:JS:23,
title = {Micro-batch and data frequency for stream processing on multi-cores},
author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1007/s11227-022-05024-y},
doi = {10.1007/s11227-022-05024-y},
year = {2023},
date = {2023-01-01},
journal = {The Journal of Supercomputing},
volume = {79},
number = {8},
pages = {9206-9244},
publisher = {Springer},
abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines. |
2022
|
 | Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Combining stream with data parallelism abstractions for multi-cores Journal Article doi In: Journal of Computer Languages, vol. 73, pp. 101160, 2022. @article{LOFF:COLA:22,
title = {Combining stream with data parallelism abstractions for multi-cores},
author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1016/j.cola.2022.101160},
doi = {10.1016/j.cola.2022.101160},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
journal = {Journal of Computer Languages},
volume = {73},
pages = {101160},
publisher = {Elsevier},
abstract = {Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity. |
 | Ernstsson, August; Griebler, Dalvan; Kessler, Christoph Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems Journal Article doi In: International Journal of Parallel Programming, vol. 51, no. 5, pp. 61-82, 2022. @article{Ernstsson:IJPP:22,
title = {Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems},
author = {August Ernstsson and Dalvan Griebler and Christoph Kessler},
url = {https://doi.org/10.1007/s10766-022-00746-1},
doi = {10.1007/s10766-022-00746-1},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
journal = {International Journal of Parallel Programming},
volume = {51},
number = {5},
pages = {61-82},
publisher = {Springer},
abstract = {We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code..},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.. |
| Rockenbach, Dinei A.; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G. High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi In: XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. @inproceedings{ROCKENBACH:SBLP:22,
title = {High-Level Stream and Data Parallelism in C++ for GPUs},
author = {Dinei A. Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G. Fernandes},
url = {https://doi.org/10.1145/3561320.3561327},
doi = {10.1145/3561320.3561327},
year = {2022},
date = {2022-10-01},
booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)},
pages = {41-49},
publisher = {ACM},
address = {Uberlândia, Brazil},
series = {SBLP'22},
abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. |
| Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas Inproceedings doi In: Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 276-287, SBC, Florianópolis, Brasil, 2022. @inproceedings{ANDRADE:WSCAD:22,
title = {Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas},
author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/wscad.2022.226392},
doi = {10.5753/wscad.2022.226392},
year = {2022},
date = {2022-10-01},
booktitle = {Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)},
pages = {276-287},
publisher = {SBC},
address = {Florianópolis, Brasil},
abstract = {A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação. |
| Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Kessler, Christoph; Ernstsson, August; Fernandes, Luiz Gustavo Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing Inproceedings doi In: 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), pp. 229-232, IEEE, Gran Canaria, Spain, 2022. @inproceedings{ANDRADE:SEAA:22,
title = {Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing},
author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Christoph Kessler and August Ernstsson and Luiz Gustavo Fernandes},
url = {https://doi.org/10.1109/SEAA56994.2022.00043},
doi = {10.1109/SEAA56994.2022.00043},
year = {2022},
date = {2022-09-01},
booktitle = {48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022)},
pages = {229-232},
publisher = {IEEE},
address = {Gran Canaria, Spain},
series = {SEAA'22},
abstract = {Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work. |
| Müller, Caetano; Löff, Junior; Griebler, Dalvan; Eizirik, Eduardo Avaliação da aplicação de paralelismo em classificadores taxonômicos usando Qiime2 Inproceedings doi In: Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 25-28, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{MULLER:ERAD:22,
title = {Avaliação da aplicação de paralelismo em classificadores taxonômicos usando Qiime2},
author = {Caetano Müller and Junior Löff and Dalvan Griebler and Eduardo Eizirik},
url = {https://doi.org/10.5753/eradrs.2022.19152},
doi = {10.5753/eradrs.2022.19152},
year = {2022},
date = {2022-04-01},
booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul},
pages = {25-28},
publisher = {Sociedade Brasileira de Computação},
address = {Curitiba, Brazil},
abstract = {A classificação de sequências de DNA usando algoritmos de aprendizado de máquina ainda tem espaço para evoluir, tanto na qualidade do resultado quanto na eficiência computacional dos algoritmos. Nesse trabalho, realizou-se uma avaliação de desempenho em dois algoritmos de aprendizado de máquina da ferramenta Qiime2 para classificação de sequências de DNA. Os resultados mostram que o desempenho melhorou em até 9,65 vezes utilizando 9 threads.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
A classificação de sequências de DNA usando algoritmos de aprendizado de máquina ainda tem espaço para evoluir, tanto na qualidade do resultado quanto na eficiência computacional dos algoritmos. Nesse trabalho, realizou-se uma avaliação de desempenho em dois algoritmos de aprendizado de máquina da ferramenta Qiime2 para classificação de sequências de DNA. Os resultados mostram que o desempenho melhorou em até 9,65 vezes utilizando 9 threads. |
| Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de Framework para Processamento de Stream Distribuído em C++ utilizando o MPI Inproceedings doi In: Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 91-92, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{LOFF:ERAD:22,
title = {Proposta de Framework para Processamento de Stream Distribuído em C++ utilizando o MPI},
author = {Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/eradrs.2022.19177},
doi = {10.5753/eradrs.2022.19177},
year = {2022},
date = {2022-04-01},
booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul},
pages = {91-92},
publisher = {Sociedade Brasileira de Computação},
address = {Curitiba, Brazil},
abstract = {Este trabalho apresenta uma proposta de framework para processamento de stream distribuído em C++ com MPI. A etapa inicial do estudo aborda a problemática de pesquisa e a concepção da arquitetura do framework.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este trabalho apresenta uma proposta de framework para processamento de stream distribuído em C++ com MPI. A etapa inicial do estudo aborda a problemática de pesquisa e a concepção da arquitetura do framework. |
| Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Towards Efficient Stream Parallelism for Embedded Devices Inproceedings doi In: Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 62-64, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{HOFFMANN:ERAD:22,
title = {Towards Efficient Stream Parallelism for Embedded Devices},
author = {Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes},
url = {https://doi.org/10.5753/eradrs.2022.19163},
doi = {10.5753/eradrs.2022.19163},
year = {2022},
date = {2022-04-01},
booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul},
pages = {62-64},
publisher = {Sociedade Brasileira de Computação},
address = {Curitiba, Brazil},
abstract = {Stream processing applications process raw data-flows to reveal insightful information. Efficiently coordinating the requirements of these applications is a challenge. We propose investigating high-level software solutions for these applications to achieve efficiency and high performance for embedded devices.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stream processing applications process raw data-flows to reveal insightful information. Efficiently coordinating the requirements of these applications is a challenge. We propose investigating high-level software solutions for these applications to achieve efficiency and high performance for embedded devices. |
| Faé, Leonardo; Griebler, Dalvan; Manssour, Isabel Aplicação de Vídeo com Flink, Storm e SPar em Multicores Inproceedings doi In: Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 13-16, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{FAE:ERAD:22,
title = {Aplicação de Vídeo com Flink, Storm e SPar em Multicores},
author = {Leonardo Faé and Dalvan Griebler and Isabel Manssour},
url = {https://doi.org/10.5753/eradrs.2022.19149},
doi = {10.5753/eradrs.2022.19149},
year = {2022},
date = {2022-04-01},
booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul},
pages = {13-16},
publisher = {Sociedade Brasileira de Computação},
address = {Curitiba, Brazil},
abstract = {Este trabalho apresenta comparações de desempenho entre as interfaces de programação SPar, Apache Flink e Apache Storm, no que diz respeito à execução de uma aplicação de processamento de vídeo. Os resultados revelam que as versões da SPar apresentam um desempenho superior, enquanto o Apache Storm apresentou o pior desempenho.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Este trabalho apresenta comparações de desempenho entre as interfaces de programação SPar, Apache Flink e Apache Storm, no que diz respeito à execução de uma aplicação de processamento de vídeo. Os resultados revelam que as versões da SPar apresentam um desempenho superior, enquanto o Apache Storm apresentou o pior desempenho. |