The rapid advancement of large language models (LLMs) has caused their parameter sizes to grow beyond the memory capacity of a single GPU. Although distributed inference across multiple GPUs is a solution in enterprise settings, it remains inaccessible for most non-commercial users. Thus, there is a growing demand to run LLMs on a single GPU when the model does not fit entirely in GPU memory. A common approach is to offload parts of the model from the GPU to the CPU during inference. However, repeatedly transferring parameters between these devices incurs significant overhead. To address this challenge, we propose a new buffer management policy, LIRS-M, which maximizes buffer hits and minimizes data transfer. Experimental results show that our approach achieves a 2.0× speedup compared to StoA offloading techniques while delivering robust buffer-hit performance.
Selecting an unsuitable deep-learning (DL) model for often results in negative transfer and poor performance. To address this, we propose a novel ranking approach designed to guide DL model selection by leveraging dataset characteristics in multivariate time-series data. Our approach considers three dataset characteristics: statistical, shape-based, and a combination of both. By implicitly learning dataset similarities, the ranking model identifies the most suitable DL model for positive transfer learning. Experiments show that a ranking model effectively learns to rank the most suitable DL model at position one. The impact of characteristics on the ranking model's performance depends on the size of the training rankings. This approach highlights the importance of learning to select DL models for on multivariate time-series datasets and offers a practical solution for effective in time-series applications.
Data is a central resource for modern enterprises and institutions, and data validation is essential for ensuring the reliability of downstream applications. However, a major limitation of existing automated data unit testing frameworks is that they ignore the specific requirements of the tasks that consume the data. This paper introduces a task-aware approach to data validation that leverages large language models to generate customized data unit tests based on the semantics of downstream code. We present tadv, a prototype system that analyzes task code and dataset profiles to identify data access patterns, infer implicit data assumptions, and produce executable code for data unit tests. We evaluate our prototype with a novel benchmark comprising over 100 downstream tasks across two datasets, including annotations of their column access patterns and support for assessing the impact of synthetically injected data errors. We demonstrate that tadv outperforms task-agnostic baselines in detecting the data columns accessed by downstream tasks and generating data unit tests that account for the end-to-end impact of data errors. We make our benchmark and prototype code publicly available.
Deep learning models are increasingly deployed in performance-critical applications, where computational efficiency must be balanced with accuracy. Model compression techniques such as pruning and quantization help address this challenge. However, they are often evaluated solely on accuracy, overlooking their impact on model size and inference time for a given hardware setup.
To support practitioners in selecting and evaluating different model compression techniques, we introduce our benchmarking framework, PQ Bench. It pre-implements a set of popular compression techniques and automates the process of benchmarking their effects on accuracy, inference speed, and memory footprint. In our evaluation, we demonstrate the use of PQ Bench and provide key insights into the trade-offs across various models, compression strategies, and configurations.
Thanks to the in context learning abilities of LLMs, building a text classifier without access to labeled data is easier than ever. However, for more complex tasks than simple classification, the difficulties remain. One such task is hierarchical segmentation where a model needs to break down an input document in an hierarchy of segments each annotated with a label from a taxonomy of classes. The long input documents, the large input class taxonomies as well as the increased number of outputs make the problem challenging to solve with a single LLM call. While hierarchical segmentation is amenable to splitting in smaller more manageable tasks, there is a huge design space of such approaches making the process tedious and time consuming. To reduce the amount of ad hoc exploration and implementation, we propose the first framework for hierarchical text segmentation using LLMs. The key idea behind our framework is that hierarchical segmentation can be viewed as a join between document and taxonomy. Inspired by join operator design, we propose two highly configurable hierarchical segmentation algorithms based on index and merge sort joins. Our experiments highlight the existence of trade offs across algorithms and their configurations, indicating that machine learning engineers may benefit from quickly exploring the design space using our framework.
Machine Learning (ML) in industrial chemistry is often hindered by the complexity of preprocessing heterogeneous datasets. In this proof-of-concept study, we explore the use of semantic data management to support LLM-driven automation of end-to-end ML pipelines in a real-world Chemistry 4.0 setting. A semantic model is used to capture domain knowledge and metadata in a machine-readable form, guiding LLMs through natural language prompts to generate complete data wrangling and ML modeling code. We evaluate several state-of-the-art LLMs on their ability to autonomously produce functionally correct Python code for preprocessing and Gaussian Process modeling. Our results show that, when guided by structured semantic context, larger LLMs can reliably generate accurate pipelines, significantly reducing the need for manual intervention. These findings provide an encouraging starting point for further exploration toward leveraging the semantic model to improve the robustness of code generation by systematically integrating relevant information into the generation process, rather than relying solely on the raw intelligence of the LLM.
Forward and reverse mode automatic differentiation evaluate the gradient of a model function efficiently by caching the results of partial derivatives. Just-in-time compilation improves the runtime of automatic differentiation by eliminating function calls and storing partial derivatives in virtual registers. This paper discusses the first open-source implementation of automatic differentiation with MLIR and LingoDB. The evaluation compares optimizations applied to forward and reverse modes. It showed that sub-expressions, that appear frequently within the calculation, will be reused after MLIR performs its optimization. Additionally, reverse mode outperforms forward mode due to less generated code.
Corporations continually generate and receive new tabular data whose integration into existing systems is crucial to maximizing business value. However, this integration is challenging due to data heterogeneity and lack of standard formats. Current methods focus on identifying relationships between table pairs, often neglecting table-to-database scenarios. Conventional schema matching techniques map columns between relational schemas but overlook subtables that could better align with database elements, limiting their effectiveness. We introduce the task of table dissolution: the identification of queries to integrate new tabular data into databases via coherent and semantically meaningful views on the new data. This process allows a fine-grained integration where new data dissolves into a database as salt dissolves into water.
Video Retrieval-Augmented Generation (RAG) Workflows augment user query contexts with stored video contexts to guide a Large Language Model (LLM) in providing more context-relevant answers. The challenge is to design a workflow where the video contexts can be generated with practical latency values. It is also important to ensure that the resources available to run the workflows are not under-utilized whenever possible. In this paper, we propose a video RAG workflow that utilizes the Visual Data Management System (VDMS) interfaced with the Kubernetes (K8s) orchestration framework to design an enhanced video RAG workflow (VRAG) for faster video context generation in the pre-processing phase and a faster response generation with no impact on response accuracy. Our experiments showcase that compared to Conventional RAG(CRAG) workflows, VRAG reduces the pre-processing latency by a minimum of 10%, which decreases manifold with parallelization and the response latency by a minimum of 70% to as high as 97% with further parallelization.
We consider machine learning models, learned from data, to be an important, intensional kind of data in themselves. As such, various analysis tasks on models can be thought of as queries over this intensional data, often combined with extensional data such as data for training or validation. We argue that relational database systems and SQL can be well suited for many such tasks.
In this paper, we propose Dataset2Graph, a novel methodology that targets the problem of automated machine learning (AutoML) for clustering. Dataset2Graph offers an end-to-end solution to meta-learning, by encoding any input dataset as a similarity graph which undergoes graph reduction techniques (sparsification and coarsening) to obtain a smaller and more concise graph representation of the dataset. Besides making the processing more efficient, the graph reduction aims to keep the most important structural properties of the dataset. Subsequently, a graph neural network (GNN) architecture is applied on the reduced graph, producing a graph embedding, which is propagated to a meta-learner that processes this embedding to predict the best clustering algorithm for the dataset at hand. Our empirical evaluation using a set of 50 collected datasets shows the effectiveness of Dataset2Graph for different GNN models under various settings and in comparison with the state-of-the-art.