publications
- Faster-BNI: Fast Parallel Exact Inference on Bayesian NetworksJiantong Jiang, Zeyi Wen, Atif Mansoor, and Ajmal MianIEEE Transactions on Parallel and Distributed Systems, 2024
- Fast inference for probabilistic graphical modelsJiantong Jiang, Zeyi Wen, Atif Mansoor, and Ajmal MianIn 2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024
Probabilistic graphical models (PGMs) have attracted much attention due to their firm theoretical foundation and inherent interpretability. However, existing PGM inference systems are inefficient and lack sufficient generality, due to issues with irregular memory accesses, high computational complexity, and modular design limitation. In this paper, we present Fast-PGM, a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms. Fast-PGM incorporates careful memory management techniques to reduce memory consumption and enhance data locality. It also employs computation and parallelization optimizations to reduce computational complexity and improve the overall efficiency. Furthermore, Fast-PGM offers high generality and flexibility, allowing easy integration with all the mainstream importance sampling-based algorithms. The system abstraction of Fast-PGM facilitates easy optimizations, extensions, and customization for users. Extensive experiments show that Fast-PGM achieves 3 to 20 times speedup over the state-of-the-art implementation. Fast-PGM source code is freely available at https://github.com/jjiantong/FastPGM.
- Efficient Hyperparameter Optimization with Adaptive Fidelity IdentificationJiantong Jiang, Zeyi Wen, Atif Mansoor, and Ajmal MianIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Hyperparameter Optimization and Neural Architecture Search are powerful in attaining state-of-the-art machine learning models, with Bayesian Optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic in this field, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which excels in adaptively deciding the fidelity for each configuration and providing strong performance while ensuring efficient resource usage. These advantages are achieved through our proposed techniques based on the concepts of efficient point and saturation point for each configuration, which can be obtained from the empirical learning curve of the configuration, estimated from early observations. Extensive experiments demonstrate FastBO’s superior anytime performance and efficiency in identifying high-quality configurations and architectures. We also show that our method provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting the wide applicability of our approach.
- Parallel and Distributed Bayesian Network Structure LearningJian Yang, Jiantong Jiang, Zeyi Wen, and Ajmal MianIEEE Transactions on Parallel and Distributed Systems (TPDS), 2024
Bayesian networks (BNs) are graphical models representing uncertainty in causal discovery, and have been widely used in medical diagnosis and gene analysis due to their effectiveness and good interpretability. However, mainstream BN structure learning methods are computationally expensive, as they must perform numerous conditional independence (CI) tests to decide the existence of edges. Some researchers attempt to accelerate the learning process by parallelism, but face issues including load unbalancing, costly dominant parallelism overhead. We propose a multi-thread method, namely Fast-BNS version 1 (Fast-BNS-v1 for short), on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS-v1 incorporates a series of efficiency optimizations, including a dynamic work pool for better scheduling, grouping CI tests to avoid unnecessary operations, a cache-friendly data storage to improve memory efficiency, and on-the-fly conditioning sets generation to avoid extra memory consumption.To further boost learning performance, we develop a two-level parallel method Fast-BNS-v2 by integrating edge-level parallelism with multi-processes and CI-level parallelism with multi-threads. Fast-BNS-v2 is equipped with careful optimizations including dynamic work stealing for load balancing, SIMD edge list deletion for list updating, and effective communication policies for synchronization. Comprehensive experiments show that our Fast-BNS achieves 9 to 235 times speedup over the state-of-the-art multi-threaded method on a single machine. When running on multi-machines, it further reduces the execution time of the single-machine implementation by 80%.
- Staleness-reduction Mini-batch K-meansXueying Zhu, Jie Sun, Zhenhao He, Jiantong Jiang, and Zeke WangIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023
K-means is a clustering algorithm that has been widely adopted due to its simple implementation and high clustering quality. However, the standard k-means suffers from high computational complexity and is therefore time-consuming. Accordingly, the mini-batch k-means is proposed to significantly reduce computational costs in a manner that updates centroids after performing distance computations on just a mini-batch, rather than a full batch, of samples. Even though the mini-batch k-means converges faster, it leads to a decrease in convergence quality because it introduces staleness during iterations. To this end, in this paper, we propose the staleness-reduction mini-batch k-means (srmbatch), which achieves the best of two worlds: low computational costs like the mini-batch k-means (mbatch) and high clustering quality like the standard k-means (km). Moreover, srmbatch still exposes massive parallelism to be efficiently implemented on multi-core CPUs and many-core GPUs. The experimental results show that srmbatch can converge up to 40X-130X faster than mbatch when reaching the same target loss, and srmbatch is able to reach 0.2%-1.7% lower final loss than that of mbatch.
- Fast Parallel Exact Inference on Bayesian NetworksJiantong Jiang, Zeyi Wen, Atif Mansoor, and Ajmal MianIn ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2023
Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://github.com/jjiantong/FastBN.
- Fast Parallel Bayesian Network Structure LearningJiantong Jiang, Zeyi Wen, and Ajmal MianIn IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022
Bayesian networks (BNs) are a widely used graphical model in machine learning for representing knowledge with uncertainty. The mainstream BN structure learning methods require performing a large number of conditional independence (CI) tests. The learning process is very time-consuming, especially for high-dimensional problems, which hinders the adoption of BNs to more applications. Existing works attempt to accelerate the learning process with parallelism, but face issues including load unbalancing, costly atomic operations and dominant parallel overhead. In this paper, we propose a fast solution named FastBNS on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS is powered by a series of efficiency optimizations including (i) designing a dynamic work pool to monitor the processing of edges and to better schedule the workloads among threads, (ii) grouping the CI tests of the edges with the same endpoints to reduce the number of unnecessary CI tests, (iii) using a cache-friendly data storage to improve the memory efficiency, and (iv) generating the conditioning sets onthe-fly to avoid extra memory consumption. A comprehensive experimental study shows that the sequential version of FastBNS is up to 50 times faster than its counterpart, and the parallel version of Fast-BNS achieves 4.8 to 24.5 times speedup over the state-of-the-art multi-threaded solution. Moreover, Fast-BNS has a good scalability to the network size as well as sample size.
- Parallel and distributed structured SVM trainingJiantong Jiang, Zeyi Wen, Zeke Wang, Bingsheng He, and Jian ChenIEEE Transactions on Parallel and Distributed Systems (TPDS), 2021
Structured Support Vector Machines (structured SVMs) are a fundamental machine learning algorithm, and have solid theoretical foundation and high effectiveness in applications such as natural language parsing and computer vision. However, training structured SVMs is very time-consuming, due to the large number of constraints and inferior convergence rates, especially for large training data sets. The high cost of training structured SVMs has hindered its adoption to new applications. In this article, we aim to improve the efficiency of structured SVMs by proposing a parallel and distributed solution (namely FastSSVM) for training structured SVMsbuilding on top of MPI and OpenMP. FastSSVMexploits a series of optimizations (e.g., optimizations on data storage and synchronization) to efficiently use the resources of the nodes in a cluster and the cores of the nodes. Moreover, FastSSVM tackles the large constraint set problem by batch processing and addresses the slow convergence challenge by adapting stop conditions based on the improvement of each iteration. We theoretically prove that our solution is guaranteed to converge to a global optimum. A comprehensive experimental study shows that FastSSVM can achieve at least four times speedup over the existing solutions, and in somecasescan achieve twoto three orders of magnitude speedup.
- Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAsJiantong Jiang, Zeke Wang, Xue Liu, Juan Gómez-Luna, Nan Guan, Qingxu Deng, Wei Zhang, and Onur MutluIn ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2020
FPGA vendors provide OpenCL software development kits for easier programmability, with the goal of replacing the time-consuming and error-prone register-transfer level (RTL) programming. Many studies explore optimization methods (e.g., loop unrolling, local memory) to accelerate OpenCL programs running on FPGAs. These programs typically follow the default OpenCL execution model, where a kernel deploys multiple work-items arranged into work-groups. However, the default execution model is not always a good fit for an application mapped to the FPGA architecture, which is very different from the multithreaded architecture of GPUs, for which OpenCL was originally designed. In this work, we identify three other execution models that can better utilize the FPGA resources for the OpenCL applications that do not fit well into the default execution model. These three execution models are based on two OpenCL features devised for FPGA programming (namely, single work-item kernel and OpenCL channel). We observe that the selection of the right execution model determines the performance upper bound of a particular application, which can vary by two orders magnitude between the most suitable execution model and the most unsuitable one. However, there is no way to select the most suitable execution model other than empiricall exploring the optimization space for the four of them, which can be prohibitive. To help FPGA programmers identify the right execution model, we propose Boyi, a systematic framework that makes automatic decisions by analyzing OpenCL programming patterns in an application. After finding the right execution model with the help of Boyi, programmers can apply other conventional optimizations to reach the performance upper bound. Our experimental evaluation shows that Boyi can 1) accurately determine the right execution model, and 2) greatly reduce the exploration space of conventional optimization methods.