HAMLET: Hardware Acceleration of Machine LEarning Tasks | Consiglio Nazionale delle Ricerche

Joint research project

Project leaders: Raffaele Perego, Veronica Graciela Gil Costa
Agreement: ARGENTINA - CONICET - Consejo Nacional de Investigaciones Científicas y Técnicas
Call: CNR/CONICET biennio 2017-2018 2017-2018
Department: Engineering, ICT and technologies for energy and transportation
Thematic area: Engineering, ICT and technologies for energy and transportation
Status of the project: New

Research proposal

Recent advances in Machine Learning (ML) have enabled new possibilities to model phenomena that were previously too complex for computers to handle. This radical change has opened new horizons to business and science for improving the quality, speed and accuracy of services in several parts of human existence. The complexity of machine-learnt models and their widespread use requires however novel algorithmic solutions, aimed at rendering fast and scalable both the learning phase and the use of the learnt models in large-scale applications.
As proved by the CVs enclosed to the proposal, the Argentinean and Italian research groups proposing this project have complementary expertise that can make the difference in this challenging field: ISTI-CNR has leading experience in algorithmic solutions for efficient machine learning in the large scale. In the last two years they published several papers on this specific topic in top-tier conferences and journals. On the other hand, UNSL has long-term experience in parallel computational models and platforms including System on Chip (SoC) technologies based on Field-Programmable Gate Arrays (FPGAs).
We propose to exploit this strong background to investigate how hardware acceleration platforms based on FPGAs can offer viable and relevant opportunities to make complex ML models faster and more scalable.
The main focus of the project is towards the hardware acceleration of ML models based on forests of regression trees generated by boosting meta-algorithms that iteratively learn and combine thousands of simple decision trees. The reason for this choice is twofold:
o Machine-learned models based on forests of trees have proved to be particularly robust and effective in several classification, regression, and ranking tasks, but they are very demanding from a computational point of view. In fact, all the trees of the forest need to be traversed for each item to which the model is applied in order to compute their contribution to the final result. The high computational cost becomes a challenging issue in applications where the time budget available to apply the learnt model to a possibly huge number of items is limited and the users' expectations in terms of quality-of-service is very high.
o The Italian proposers recently proposed QuickScorer (QS), awarded best paper at ACM Conference on Research and Development in Information Retrieval (SIGIR) 2015 and subject of a pending international patent, that is currently deployed in production by the Italian search engine istella.it. QS remarkably improves the performance of any forest-based ML model by dealing with features and characteristics of modern CPUs and memory hierarchies. QS is particularly suited for a SoC based FPGAs implementation because it adopts a novel compact bit-vector representation of the tree-based model and performs the traversal of the ensemble by means of simple logical bitwise operations. The traversal is not performed by QS one tree after another, as one would expect, but is instead interleaved, feature by feature, over the whole tree ensemble. Due to its cache-aware approach, both in terms of data layout and access patterns, and to a control flow that entails very low branch misprediction rates, the performance of QS is impressive, resulting in speedups on traditional CPUs of up to 6.5x over state-of-the-art competitors. We expect to achieve a further order of magnitude speedup with the FPGA implementation.
We believe that hardware acceleration is an attractive method to cope with execution times of ML models and that FPGA technology definitely deserves to be investigated to scale ML models further the capabilities of traditional CPUs. SoC-based FPGAs provide a system solution that can be rapidly deployed yet flexible enough to adapt to specific design requirements and to changing demands. In general, they are well known for their configurability. However, this potential benefit is quite often not fully realized because creating efficient FPGA designs is generally carried out in a laborious, case-specific manner requiring a great amount of redundant time and effort. The design flow for SoCs includes requirements and specification, software/hardware partitioning (Hw/Sw co-design), hardware and software development/testing, system integration and testing. The software/hardware partitioning stage means that given a specified code in C/C++, the designers can divide the code to accelerate part of the system in the programmable logic (hardware) and leave part of the code in sequential, using the processing system. Hw/Sw co-design is done by analyzing the different tasks to be undertaken by the system design, taking into account the execution time and the possibilities of parallelization. The specific domain of the project mitigates this risk because of the generality of the ML models addressed. Our first goal is in fact designing a fully-optimized FPGA implementation of the QS algorithm that takes as input any forest of regression trees learnt with a black-box ML algorithm and provides a very efficient function for traversing the forest with any item to be scored/classified. Since the optimized FPGA implementation of QS will be totally independent of the specific forest, it will be employable without any modification in any classification/regression/ranking application based on ML forests of trees.

The impact of the resulting solution in real-world applications could be huge. Consider for example a Web Search scenario, where millions of candidate documents are scored to return relevant query results in less than half a second to any requesting user. Forest-based ML ranking models are the state of the art in this scenario but, due to their high computational cost, suboptimal rankers are used to fit the time budget available. Devising techniques to speed up document ranking without losing in quality is definitely an urgent research topic in Web Search.

Research goals

We expect to design, implement and evaluate the QS algorithm on SoC based FPGAs platforms. This task requires to carefully detect QS data dependences and the functions that can be efficiently parallelized. Particular attention will be reserved to investigate memory usage and memory access patterns to find out their impact on the SoC implementation of ML models varying for size and complexity. The collaboration on this specific case study will allow the proposers to evaluate the impact of hardware acceleration on the efficiency of ML models for classification, regression and ranking tasks. The results achieved and the domain knowledge gained will be generalized and possibly applied to other relevant computationally-intensive ML tasks that can benefit from hardware acceleration. Additional outcomes of the project are:
1) The enhancement of scientific collaboration between Argentina and Italy;
2) The transfer of technology and knowledge between both teams;
3) The exchange of (young) researchers and PhD students between the two research groups;
4) The publication of joint papers in top-tier international conferences and journals.
The project will be implemented by means of Research Travels for the training of researchers on topics dominated by the receiving institution. During the research visits courses and seminars will be offered to disseminate and transfer the reciprocal expertize. The workplan will be organized in the three WPs detailed below.

Last update: 03/08/2025