时空ver最后的回忆
This commit is contained in:
816
mypaper/IJCAI2026_MESSA.tex
Normal file
816
mypaper/IJCAI2026_MESSA.tex
Normal file
@@ -0,0 +1,816 @@
|
||||
\title{Multi-Task Shared-Specific Sparse Fine-Tuning for Large Language Models}
|
||||
\begin{document}
|
||||
|
||||
\maketitle
|
||||
\begin{abstract}
|
||||
|
||||
|
||||
Large language models are increasingly required to support multiple downstream tasks under strict parameter budgets, where many PEFT methods introduce auxiliary modules that incur additional overhead.
|
||||
Sparse fine-tuning avoids this by directly applying sparse parameter updates to pretrained weights, without modifying model architectures or introducing inference latency.
|
||||
However, existing sparse fine-tuning methods are mostly designed for single-task settings and lack systematic modeling of structure sharing and budget allocation in multi-task scenarios.
|
||||
To tackle these challenges, we propose MESSA, a multi-task shared-specific sparse fine-tuning framework for large language models.
|
||||
MESSA decomposes task adaptations into globally shared and task-specific sparse deltas, allowing flexible sharing across related tasks.
|
||||
To enforce a unified parameter budget, MESSA adopts a budget-aware soft-to-hard structure learning strategy, where differentiable gates are first optimized to induce structured sparsity and then hardened via a single global pruning step.
|
||||
Extensive experiments on multi-task benchmarks demonstrate that MESSA consistently outperforms existing PEFT baselines under comparable parameter budgets.
|
||||
\end{abstract}
|
||||
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=1\linewidth]{assets/attndiff.png}
|
||||
\caption{Task-dependent activation differences between CodeAlpaca and MedQA across different attention modules and layers. Red indicates higher activation in CodeAlpaca, while blue indicates higher activation in MedQA. This highlights the task-specific and shared adaptation requirements at different layers and modules during multi-task fine-tuning.}
|
||||
\label{fig:attndiff}
|
||||
\end{figure}
|
||||
\section{Introduction}
|
||||
Large language models (LLMs) have emerged as general-purpose backbones for a wide range of real-world applications.
|
||||
In practical deployment scenarios, a single pre-trained model is often required to simultaneously support multiple downstream tasks under strict constraints on storage, training cost, and inference efficiency.
|
||||
These constraints make full-parameter fine-tuning impractical and have driven extensive research on parameter-efficient fine-tuning (PEFT) methods.
|
||||
|
||||
|
||||
|
||||
PEFT methods adapt LLMs by updating only a small subset of parameters while keeping most pre-trained weights frozen~\cite{han2024parameter}.
|
||||
Among these, sparse fine-tuning has recently attracted attention due to its favorable deployment properties~\cite{shiracite}.
|
||||
By directly applying sparse parameter updates to pre-trained weights, sparse fine-tuning avoids introducing additional modules or modifying the model architecture, thereby preventing extra inference latency.
|
||||
In contrast, while LoRA~\cite{hu2021lora} and adapter-based~\cite{houlsby2019parameter} approaches reduce trainable parameters, additional modules introduce extra complexity for deployment and task switching in multi-task settings.
|
||||
|
||||
Despite their success, most existing PEFT and sparse fine-tuning methods are developed primarily for single-task adaptation.
|
||||
When extended to multi-task scenarios, they encounter two fundamental challenges that remain insufficiently explored.
|
||||
\textbf{(1) Task Sharing Challenge:} Existing methods either enforce full sharing of sparse structures across tasks, failing to capture task-specific variations, or learn entirely separate updates for each task, which leads to redundant parameters and inefficient resource allocation.
|
||||
In addition, both strategies fail to model the partial and structured dependencies that commonly exist among tasks.
|
||||
As a result, they struggle to balance cross-task knowledge sharing with the flexibility required for effective task-specific adaptation.
|
||||
\textbf{(2) Resource Allocation Challenge:} Most existing methods allocate adaptation parameters independently for each task, often using uniform budget ratios or manually specified task constraints.
|
||||
However, the lack of a global allocation mechanism prevents shared and task-specific parameters from jointly competing for limited resources.
|
||||
This isolated allocation leads to either underutilization or over-allocation of parameters, ultimately resulting in suboptimal performance.
|
||||
|
||||
|
||||
Empirical observations from multi-task fine-tuning further highlight the heterogeneous adaptation requirements across tasks.
|
||||
As illustrated in Figure~\ref{fig:attndiff}, activation patterns exhibit significant task-dependent differences, indicating that certain layers and modules benefit from shared representations, while others require task-specific modifications.
|
||||
These observations emphasize the key challenge of multi-task sparse fine-tuning: how to allocate limited adaptation capacity across tasks under a unified parameter budget.
|
||||
The challenge extends beyond identifying which parameters to update, and critically involves determining how to balance shared and task-specific adaptations.
|
||||
In this setting, each parameter group faces a discrete structural decision: it may remain frozen, be shared across tasks, or be specialized for a particular task.
|
||||
These decisions are inherently interdependent: allocating more shared parameters can improve cross-task generalization but reduces the budget available for task-specific, whereas excessive task-specific updates bring redundancy and inefficiency.
|
||||
However, existing approaches typically rely on static or heuristic allocation strategies, which lack the flexibility to adaptively balance shared and task-specific structures based on task relationships and training signals.
|
||||
|
||||
To address these challenges, we formulate multi-task sparse fine-tuning as a structure allocation problem under a unified parameter budget.
|
||||
The objective is to allocate sparse adaptation capacity within a fixed backbone, balancing shared knowledge across tasks with task-specific specialization, while adhering to global resource constraints.
|
||||
Based on this, we propose \textbf{MESSA}, \textbf{M}ulti-task \textbf{E}fficient \textbf{S}hared-specific \textbf{S}parse \textbf{A}daptation, a shared-specific sparse fine-tuning framework for multi-task adaptation of LLMs.
|
||||
MESSA decomposes the adaptation for each task into the sum of a globally shared sparse update and a task-specific sparse update, enabling flexible modeling of both common and task-dependent knowledge.
|
||||
To determine how adaptation capacity should be allocated, MESSA introduces a budget-aware soft gating mechanism that induces structured sparsity during training.
|
||||
After learning the soft structure, a one-shot global pruning step is applied to convert the soft gates into a fixed sparse model, ensuring no additional inference overhead while preserving performance.
|
||||
|
||||
By efficiently allocating sparse adaptation capacity across tasks and explicitly modeling shared and task-specific structures, MESSA significantly improves parameter efficiency and multi-task performance compared to existing PEFT methods.
|
||||
Importantly, MESSA does not modify the backbone architecture or introduce auxiliary modules, making it well-suited for practical deployment.
|
||||
The main contributions of this paper are summarized as follows:
|
||||
\begin{itemize}[leftmargin=*, topsep=0pt]
|
||||
\item We propose MESSA, a novel shared-specific sparse fine-tuning framework for multi-task adaptation of LLMs.
|
||||
MESSA explicitly models both cross-task shared knowledge and task-specific adaptation within a unified parameter budget, enabling flexible knowledge sharing across tasks while maintaining task-specific specialization.
|
||||
|
||||
\item We formulate multi-task sparse fine-tuning as a structure allocation problem and introduce a budget-aware soft-to-hard structure learning approach.
|
||||
This approach automatically allocates sparse adaptation capacity via soft gating and produces a deployable, performance-preserving sparse model through one-shot pruning.
|
||||
|
||||
\item Extensive experiments on diverse multi-task benchmarks demonstrate that MESSA outperforms existing PEFT baselines under identical parameter budgets, validating its effectiveness, efficiency, and scalability.
|
||||
\end{itemize}
|
||||
|
||||
|
||||
\begin{figure*}[ht]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{assets/model2.pdf}
|
||||
\caption{MESSA framework with shared--specific sparse updates. Sparse structures are learned via budget-aware soft gating and overlap regularization, and hardened through a soft-to-hard training process under a unified parameter budget.}
|
||||
\label{fig:framework}
|
||||
\end{figure*}
|
||||
|
||||
\section{Preliminaries and Problem Setup}
|
||||
\label{sec:pre}
|
||||
|
||||
In this section, we first review sparse parameter-efficient fine-tuning and formally define the general budget-constrained multi-task fine-tuning problem studied in this work.
|
||||
|
||||
\subsection{Sparse Parameter-Efficient Fine-Tuning}
|
||||
|
||||
Parameter-efficient fine-tuning (PEFT) aims to adapt a pre-trained LLM by updating only a small subset of parameters while keeping the backbone frozen.
|
||||
Let $\mathcal{M}$ denote a model with parameters $\mathbf{W}$.
|
||||
Sparse PEFT parameterizes updates as a sparse update $\Delta$, yielding the adapted model for a task $t$:
|
||||
\begin{equation}
|
||||
\mathcal{M}^{(t)} = \mathcal{M} + \Delta^{(t)},
|
||||
\end{equation}
|
||||
where $\Delta^{(t)}$ denotes the sparse task-specific adaptation for task $t$, \ie only a small fraction of its entries are non-zero.
|
||||
|
||||
In practice, sparse updates are typically applied in a structured manner, where parameters are selected and updated at the level of parameter groups (\eg rows or blocks of weight matrices), rather than individual scalar weights.
|
||||
This selective update mechanism allows for efficient fine-tuning with minimal parameter overhead.
|
||||
In contrast to low-rank methods such as LoRA~\cite{hu2021lora} and adapter-based PEFT methods~\cite{houlsby2019parameter}, which introduce additional modules to parameterize task adaptations, sparse PEFT directly modifies existing weights and preserves the original model architecture, avoiding additional inference overhead.
|
||||
|
||||
\subsection{Budget-Constrained Multi-Task Fine-Tuning}
|
||||
|
||||
We consider a multi-task learning setting with $T$ downstream tasks $\{\mathcal{T}_t\}_{t=1}^T$.
|
||||
Under sparse PEFT, each task is adapted by a sparse update, and all task adaptations must jointly satisfy a unified global parameter budget.
|
||||
Formally, we decompose the sparse adaptation for task $t$ into two components:
|
||||
\begin{equation}
|
||||
\Delta^{(t)} = \Delta_{\mathrm{sh}} + \Delta_{\mathrm{sp}}^{(t)},
|
||||
\end{equation}
|
||||
where $\Delta_{\mathrm{sh}}$ is shared sparse update applied across all tasks, and $\Delta_{\mathrm{sp}}^{(t)}$ denotes a task-specific update unique to task $t$.
|
||||
|
||||
|
||||
We assume that sparse updates are organized into structured parameter groups.
|
||||
Let $\mathcal{G}$ denote the collection of all parameter groups, and $s_g$ represent the parameter cost associated with group $g \in \mathcal{G}$.
|
||||
A parameter group is considered active if it is selected for updating in either the shared or task-specific component.
|
||||
The total adaptation cost across all tasks is constrained by a unified budget $B$, formalized as:
|
||||
\begin{equation}
|
||||
\sum_{g \in \mathcal{G}} s_g \cdot \mathbb{I}[g \in \Delta_{\mathrm{sh}}]
|
||||
+ \sum_{t=1}^T \sum_{g \in \mathcal{G}} s_g \cdot \mathbb{I}[g \in \Delta_{\mathrm{sp}}^{(t)}]
|
||||
\le B,
|
||||
\end{equation}
|
||||
where $\mathbb{I}[\cdot]$ is an indicator function that takes the value 1 if the parameter group $g$ is activated in the update and 0 otherwise. Note that task-specific updates are counted separately for each task, as they correspond to distinct parameter tensors even when they operate on the same backbone group.
|
||||
|
||||
The objective of budget-constrained multi-task fine-tuning is to allocate limited adaptation capacity across shared and task-specific updates so that all tasks are effectively adapted while satisfying the global budget constraint.
|
||||
|
||||
|
||||
|
||||
\section{Method}
|
||||
\label{sec:method}
|
||||
|
||||
In this section, we present the MESSA framework, which addresses the challenges of budget-constrained multi-task sparse fine-tuning. We begin by providing an overview of the framework, followed by a detailed description of its key components, and finally present the overall algorithm.
|
||||
\subsection{Framework Overview}
|
||||
|
||||
In multi-task scenarios, existing PEFT methods exhibit two fundamental limitations:
|
||||
(1) ignoring partial yet significant dependencies between tasks, resulting in redundant and inefficient resource allocation.
|
||||
(2) lacking a global mechanism to balance shared and task-specific adaptation under a unified parameter budget, which prevents efficient allocation of adaptation capacity across tasks.
|
||||
|
||||
|
||||
|
||||
To address these challenges, we propose \textsc{MESSA} (Multi-Task Efficient Shared-Specific Sparse Adaptation), a framework that formulates multi-task sparse fine-tuning as a structured allocation problem.
|
||||
The key insight is that parameter groups should be treated as decision units and explicitly assigned to remain frozen, be shared across tasks, or be specialized for individual tasks, while being optimized under a unified global budget constraint.
|
||||
As illustrated in Figure~\ref{fig:framework}, \textsc{MESSA} allocates sparse adaptation capacity by decomposing each task's adaptation into shared and task-specific sparse updates using the proposed Shared-Specific Sparse Representation (SS-Sparse), organized into structured, row-wise parameter groups, thereby modeling both common and task-dependent knowledge.
|
||||
A budget-aware soft gating mechanism guides this allocation, and after learning the soft structure, a one-shot pruning step converts it into a fixed, deployable sparse model.
|
||||
By jointly balancing shared and task-specific adaptations, \textsc{MESSA} improves multi-task performance and parameter efficiency, without introducing additional modules or inference latency, making it well suited for real-world multi-task deployment.
|
||||
|
||||
|
||||
\subsection{Shared-Specific Sparse Representation}
|
||||
\label{sec:ss_sparse}
|
||||
Effective multi-task adaptation requires capturing both cross-task commonality and task-specific specialization.
|
||||
Figure~\ref{fig:attndiff} reveals heterogeneous task-dependent activation patterns, indicating the need for both shared and task-specific adaptation.
|
||||
Motivated by this, we introduce \textbf{Shared-Specific Sparse Representation (SS-Sparse)}, which decomposes each task's adaptation into shared and task-specific sparse components.
|
||||
This representation provides an explicit and structured foundation for modeling task relatedness and specialization.
|
||||
|
||||
\subsubsection{Multi-Task Shared-Specific Delta Decomposition}
|
||||
We model multi-task sparse fine-tuning by decomposing the adaptation for each task into a shared component and a task-specific component. Formally, given a frozen backbone model $\mathcal{M}$ with parameters $\mathbf{W}$, the adapted model for task $t$ is defined by modifying the frozen parameters as
|
||||
\begin{equation}
|
||||
\mathcal{M}^{(t)} = \mathcal{M} + \Delta^{(t)},
|
||||
\label{eq:task_model}
|
||||
\end{equation}
|
||||
where the task-specific adaptation $\Delta^{(t)}$ is decomposed as
|
||||
\begin{equation}
|
||||
\Delta^{(t)} = \Delta_{\mathrm{sh}} + \Delta_{\mathrm{sp}}^{(t)}.
|
||||
\label{eq:ss_decomp}
|
||||
\end{equation}
|
||||
|
||||
Here, $\Delta_{\mathrm{sh}}$ denotes a sparse update shared across tasks, capturing cross-task commonality, while $\Delta_{\mathrm{sp}}^{(t)}$ represents a task-specific update, modeling task-dependent variations.
|
||||
This decomposition explicitly separates shared and specialized adaptation capacity within a single sparse update formulation.
|
||||
|
||||
This decomposition offers two key advantages.
|
||||
First, it allows related tasks to reuse a common set of sparse updates, reducing parameter redundancy and improving parameter efficiency.
|
||||
Second, it preserves sufficient flexibility for task-specific adaptation, avoiding the restrictive assumption of complete sharing.
|
||||
Compared to approaches that enforce either fully shared or fully independent adaptations, the shared-specific decomposition in Eq.~\ref{eq:ss_decomp} provides a more expressive and balanced formulation for multi-task sparse fine-tuning.
|
||||
|
||||
\subsubsection{Row-Wise Structured Parameter Groups}
|
||||
|
||||
To enable structured sparsity and efficient allocation of the shared and task-specific updates, we organize sparse updates into parameter groups.
|
||||
In this work, we adopt a row-wise grouping strategy for linear layers.
|
||||
Specifically, for a linear transformation with weight matrix $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$, each output row is treated as a distinct parameter group, serving as a basic decision unit.
|
||||
Let $\mathcal{G}$ denote the set of all parameter groups, and let $g \in \mathcal{G}$ index a group corresponding to one output row. The parameter cost of each group is defined as
|
||||
\begin{equation}
|
||||
s_g = d_{\text{in}},
|
||||
\label{eq:row_cost}
|
||||
\end{equation}
|
||||
reflecting the number of parameters associated with that row.
|
||||
|
||||
Row-wise grouping provides a favorable balance between flexibility and structure.
|
||||
Compared to element-wise sparsity, it significantly reduces the number of structural decisions and yields contiguous parameter blocks that are easy to prune and deploy.
|
||||
Compared to coarser groupings such as entire layers, it enables fine-grained allocation of adaptation capacity.
|
||||
Moreover, in Transformer-based models, row-wise groups naturally align with output neurons and attention projections, making them suitable units for selective adaptation.
|
||||
|
||||
\subsubsection{Group-Level Soft Gating}
|
||||
To enable differentiable structural allocation over parameter groups, we associate each parameter group with learnable soft gates.
|
||||
For each group $g \in \mathcal{G}$, we introduce a shared gate $z^{\mathrm{sh}}_g \in (0,1)$ and task-specific gates
|
||||
$z^{\mathrm{sp}}_{g,t} \in (0,1)$, which modulate the contributions of the shared and task-specific components, respectively.
|
||||
Under this group-wise representation, the shared and task-specific sparse updates can be expressed as
|
||||
\begin{equation}
|
||||
\Delta^{(t)} = \sum_{g \in \mathcal{G}} \left(
|
||||
z^{\mathrm{sh}}_g \cdot \Delta^{\mathrm{sh}}_g +
|
||||
z^{\mathrm{sp}}_{g,t} \cdot \Delta^{\mathrm{sp}}_{g,t}
|
||||
\right),
|
||||
\label{eq:gated_delta}
|
||||
\end{equation}
|
||||
where $\Delta^{\mathrm{sh}}_g$ and $\Delta^{\mathrm{sp}}_{g,t}$ denote the parameters associated with group $g$ in the shared and task-specific updates.
|
||||
|
||||
The soft gates serve as continuous allocation weights over parameter groups and act as differentiable proxies for discrete structural decisions.
|
||||
During training, a parameter group can simultaneously participate in both shared and task-specific updates,
|
||||
allowing the model to explore different degrees of sharing across tasks.
|
||||
This design enables gradient-based optimization of both parameter values and structure-related decision variables, and provides a continuous foundation for subsequent structure regularization and soft-to-hard selection.
|
||||
|
||||
\subsubsection{Shared-Specific Overlap Regularization}
|
||||
|
||||
While the shared-specific decomposition provides flexibility, excessive simultaneous activation of both shared and task-specific components may lead to redundant adaptation and unclear structural separation.
|
||||
To mitigate this issue, we introduce a shared-specific overlap regularization that penalizes concurrent activation of shared and task-specific gates. Specifically, we define the overlap regularization term as
|
||||
\begin{equation}
|
||||
\mathcal{L}_{\text{overlap}} = \sum_{t=1}^T \sum_{g \in \mathcal{G}} z^{\mathrm{sh}}_g \cdot z^{\mathrm{sp}}_{g,t},
|
||||
\label{eq:overlap}
|
||||
\end{equation}
|
||||
which assigns a higher penalty when both the shared gate $z^{\mathrm{sh}}_g$ and the task-specific gate $z^{\mathrm{sp}}_{g,t}$ are simultaneously active.
|
||||
This regularization encourages each parameter group toward being primarily assigned to either shared or task-specific adaptation, while preserving flexibility.
|
||||
By promoting clearer structural separation, it reduces redundant updates and improves the efficiency of sparse adaptation under a global budget.
|
||||
|
||||
While the overlap regularization guides soft structural allocation during training, the final sparse structure must satisfy a global budget constraint and be converted into a discrete, deployable form, which we describe next.
|
||||
|
||||
\subsection{Soft-to-Hard Structure Learning}
|
||||
\label{sec:soft_to_hard}
|
||||
|
||||
|
||||
Building on the shared-specific sparse representation introduced in Section~\ref{sec:ss_sparse},
|
||||
we describe how the sparse structure is learned and fixed under a unified budget to produce a deployable sparse model.
|
||||
To reduce the search space, we construct a candidate pool as a small multiple of the target budget based on the row-wise weight norm of the pretrained model, with minor random inclusion for diversity, and adopt a soft-to-hard structure learning strategy that hardens learned soft structural preferences into a fixed sparse structure via one-shot pruning.
|
||||
|
||||
|
||||
|
||||
\paratitle{Warmup Phase.}
|
||||
At the beginning of training, sparse adaptation parameters and structural gates are not yet informative.
|
||||
To avoid unstable allocation decisions, we introduce a warmup phase, applied for an initial period of training, in which the gating variables are frozen
|
||||
and only the sparse adaptation parameters within the candidate pool are optimized.
|
||||
During this phase, training minimizes the task loss:
|
||||
\begin{equation}
|
||||
\mathcal{L}_{\text{warmup}} = \mathcal{L}_{\text{task}}.
|
||||
\label{eq:warm_loss}
|
||||
\end{equation}
|
||||
|
||||
This warmup allows sparse updates to learn meaningful task-related representations, providing a stable initialization for subsequent budget-aware structure learning.
|
||||
|
||||
|
||||
\paratitle{Budget-Aware Soft Learning.}
|
||||
After the warmup phase, we jointly optimize the sparse adaptation parameters and structural gates under a unified budget.
|
||||
At this stage, the soft gates act as continuous allocation variables, enabling differentiable structure learning.
|
||||
To incorporate the budget, we define the expected adaptation cost associated with the soft gates as
|
||||
\begin{equation}
|
||||
\mathcal{C}_{\text{soft}} =
|
||||
\sum_{g \in \mathcal{G}} s_g \cdot z^{\mathrm{sh}}_g
|
||||
+ \sum_{t=1}^T \sum_{g \in \mathcal{G}} s_g \cdot z^{\mathrm{sp}}_{g,t},
|
||||
\label{eq:soft_cost}
|
||||
\end{equation}
|
||||
where $s_g$ denotes the parameter cost of group $g$ defined in Eq.~\ref{eq:row_cost}.
|
||||
This soft cost represents the expected number of activated parameters under the soft gates,
|
||||
and serves as a differentiable approximation to the discrete budget constraint.
|
||||
|
||||
We enforce the budget by penalizing violations of the target budget $B$ through a regularization term:
|
||||
\begin{equation}
|
||||
\mathcal{L}_{\text{budget}} =
|
||||
\max \left( 0,\ \mathcal{C}_{\text{soft}} - B \right),
|
||||
\label{eq:budget_loss}
|
||||
\end{equation}
|
||||
which softly discourages the expected adaptation cost from exceeding the target budget.
|
||||
During this phase, the overall training objective is given by
|
||||
\begin{equation}
|
||||
\mathcal{L}_{\text{soft}} =
|
||||
\mathcal{L}_{\text{task}}
|
||||
+ \mathcal{L}_{\text{budget}}
|
||||
+ \lambda_{\text{overlap}} \mathcal{L}_{\text{overlap}},
|
||||
\label{eq:soft_objective}
|
||||
\end{equation}
|
||||
where $\mathcal{L}_{\text{overlap}}$ is the shared-specific overlap regularization defined in Eq.~\ref{eq:overlap}.
|
||||
This objective jointly balances task performance, structural sparsity, and shared-specific separation.
|
||||
The resulting soft structural preferences provide the basis for deriving a discrete,
|
||||
budget-satisfying sparse structure.
|
||||
|
||||
\paratitle{One-Shot Hard Selection.}
|
||||
After budget-aware soft learning, we convert the learned soft structure into a fixed and deployable sparse structure
|
||||
via a \emph{one-shot hard selection} procedure, in which discrete structural decisions are made once.
|
||||
Specifically, parameter groups are ranked according to their learned gate values
|
||||
(\ie $z^{\mathrm{sh}}_g$ for shared updates and $z^{\mathrm{sp}}_{g,t}$ for task-specific ones),
|
||||
and groups with higher scores are selected first until the global budget constraint is satisfied.
|
||||
|
||||
All non-selected groups are pruned by setting their updates to zero, while the selected sparse updates are fixed for inference.
|
||||
As a result, the final model has a fixed sparse structure and introduces no additional overhead at inference time.
|
||||
|
||||
\subsection{Overall Algorithm}
|
||||
\label{sec:algorithm}
|
||||
|
||||
We summarize the overall training procedure of MESSA, which jointly integrates
|
||||
Shared-Specific Sparse Representation (SS-Sparse) and soft-to-hard structure learning
|
||||
to allocate sparse adaptation capacity across multiple tasks under a global budget
|
||||
and produce a fixed, deployable sparse model.
|
||||
|
||||
|
||||
Algorithm~\ref{alg:messa} summarizes the overall training procedure of \textsc{MESSA}.
|
||||
The method follows a soft-to-hard structure learning paradigm for budget-constrained multi-task adaptation.
|
||||
Specifically, \textsc{MESSA} starts with a warmup stage that stabilizes sparse adaptation parameters while keeping structural gates frozen.
|
||||
It then performs budget-aware soft learning, jointly optimizing sparse parameters and soft gates to induce shared and task-specific structures under the global budget.
|
||||
Finally, a one-shot hard selection step converts the learned soft structure into a discrete sparse structure that strictly satisfies the budget, yielding a fixed and deployable model with no additional inference overhead.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\small
|
||||
\caption{\textsc{MESSA}: Soft-to-Hard Multi-Task Sparse Fine-Tuning}
|
||||
\label{alg:messa}
|
||||
\KwIn{
|
||||
Frozen backbone model $\mathcal{M}$;
|
||||
tasks $\{\mathcal{T}_t\}_{t=1}^T$;
|
||||
global budget $B$; training steps $S$
|
||||
}
|
||||
\KwOut{
|
||||
Fixed shared sparse update $\Delta_{\mathrm{sh}}$;
|
||||
fixed task-specific sparse updates $\{\Delta_{\mathrm{sp}}^{(t)}\}_{t=1}^T$
|
||||
}
|
||||
|
||||
1. Initialize shared and task-specific sparse updates
|
||||
$\Delta_{\mathrm{sh}}, \Delta_{\mathrm{sp}}^{(t)} \leftarrow \mathbf{0}$ for all $t$\;
|
||||
2. Initialize soft gates for all parameter groups\;
|
||||
3. Construct candidate pool $\mathcal{C}$ based on row-wise weight norm\;
|
||||
4. Set warmup steps $S_{\mathrm{warmup}}$ and pruning step $S_{\mathrm{prune}}$\;
|
||||
|
||||
\For{$s = 1$ \KwTo $S$}{
|
||||
Sample a task $t$ and a mini-batch from $\mathcal{T}_t$\;
|
||||
|
||||
\If{$s \le S_{\mathrm{warmup}}$}{
|
||||
Freeze all soft gates\;
|
||||
Update $\Delta_{\mathrm{sh}}$ and $\Delta_{\mathrm{sp}}^{(t)}$
|
||||
within candidate pool $\mathcal{C}$ using task loss $\mathcal{L}_{\text{task}}$ (Eq.~\ref{eq:warm_loss})\;
|
||||
}
|
||||
\Else{
|
||||
Compute SS-Sparse gated updates using soft gates (Eq.~\ref{eq:gated_delta})\;
|
||||
Optimize sparse updates and soft gates using the budget-aware objective
|
||||
$\mathcal{L}_{\text{soft}}$ (Eq.~\ref{eq:soft_objective})\;
|
||||
}
|
||||
|
||||
\If{$s = S_{\mathrm{prune}}$}{
|
||||
Rank parameter groups by soft gate values\;
|
||||
Select shared and task-specific groups under budget $B$\;
|
||||
Convert soft gates to binary masks and prune unselected groups\;
|
||||
Fix the sparse structure for the remaining training steps\;
|
||||
}
|
||||
}
|
||||
|
||||
\Return{$\Delta_{\mathrm{sh}}, \{\Delta_{\mathrm{sp}}^{(t)}\}_{t=1}^T$}
|
||||
\end{algorithm}
|
||||
|
||||
|
||||
\begin{table*}[t]
|
||||
\small
|
||||
\centering
|
||||
\caption{Overall multi-task performance of different PEFT methods across backbone LLMs under a comparable parameter budget.
|
||||
Avg, Geo, and Worst denote Macro Average, Geometric Mean, and Worst-Task, with bold and underlined values indicating the best and second results, $^{*}$ marking statistically significant improvements over the best baseline ($p<0.05$), and Param (\%) reporting the trainable parameter ratio.}
|
||||
|
||||
\label{tab:exp1}
|
||||
\resizebox{0.95\linewidth}{!}{
|
||||
\renewcommand{\arraystretch}{1.05}
|
||||
\begin{tabular}{l|c|ccc|ccc|ccc}
|
||||
\toprule
|
||||
\multirow{2}{*}{Method}
|
||||
& \multirow{2}{*}{\makecell{Avg.Param \\(\%)}}
|
||||
& \multicolumn{3}{c|}{Qwen 3 4B}
|
||||
& \multicolumn{3}{c|}{LLaMA 3.2 3B}
|
||||
& \multicolumn{3}{c}{Gemma 3 4B} \\
|
||||
\cmidrule(lr){3-5} \cmidrule(lr){6-8} \cmidrule(lr){9-11}
|
||||
&
|
||||
& Avg$\uparrow$ & Geo$\uparrow$ & Worst$\uparrow$
|
||||
& Avg$\uparrow$ & Geo$\uparrow$ & Worst$\uparrow$
|
||||
& Avg$\uparrow$ & Geo$\uparrow$ & Worst$\uparrow$ \\
|
||||
\midrule
|
||||
LoRA (shared)
|
||||
& 2.25
|
||||
& 76.47 & 75.56 & 59.81
|
||||
& 67.05 & 65.99 & 53.03
|
||||
& 71.22 & 69.53 & 50.08 \\
|
||||
|
||||
LoRA (specific)
|
||||
& 2.25
|
||||
& \underline{76.66} & \underline{75.76} & 60.75
|
||||
& 64.70 & 63.29 & 52.75
|
||||
& \underline{71.86} & \underline{70.09} & 49.45 \\
|
||||
|
||||
AdaLoRA (shared)
|
||||
& 2.50
|
||||
& 74.82 & 73.94 & 58.24
|
||||
& 63.02 & 62.10 & 51.18
|
||||
& 65.39 & 62.85 & 42.27 \\
|
||||
|
||||
AdaLoRA (specific)
|
||||
& 2.50
|
||||
& 75.45 & 74.61 & 59.18
|
||||
& 62.94 & 61.99 & 53.03
|
||||
& 66.57 & 64.00 & 43.33 \\
|
||||
|
||||
\midrule
|
||||
SHiRA (shared)
|
||||
& 2.26
|
||||
& 74.60 & 73.51 & 56.99
|
||||
& 70.35 & 69.40 & 53.06
|
||||
& 67.99 & 65.64 & 44.27 \\
|
||||
|
||||
SHiRA (specific)
|
||||
& 2.26
|
||||
& 76.62 & 75.67 & \underline{62.64}
|
||||
& 66.94 & 65.62 & 51.33
|
||||
& 71.26 & 69.52 & \underline{50.86} \\
|
||||
|
||||
\midrule
|
||||
MTLoRA
|
||||
& 2.70
|
||||
& \underline{76.81} & \underline{75.98} & 62.01
|
||||
& \underline{71.95} & \underline{71.29} & \underline{58.08}
|
||||
& 71.60 & 69.84 & 50.24 \\
|
||||
|
||||
MOELoRA
|
||||
& 2.26
|
||||
& 76.07 & 75.27 & 60.91
|
||||
& 70.96 & 70.24 & 55.42
|
||||
& 70.52 & 68.78 & 48.67 \\
|
||||
|
||||
MESSA (ours)
|
||||
& 1.86
|
||||
& \textbf{78.01}$^{*}$ & \textbf{77.18}$^{*}$ & \textbf{62.79}$^{*}$
|
||||
& \textbf{72.96}$^{*}$ & \textbf{72.42}$^{*}$ & \textbf{59.50}$^{*}$
|
||||
& \textbf{72.40}$^{*}$ & \textbf{70.63}$^{*}$ & \textbf{51.33}$^{*}$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
}
|
||||
\end{table*}
|
||||
\begin{table}[t]
|
||||
\small
|
||||
\centering
|
||||
\caption{Scalability results of different PEFT methods across Qwen3 backbones with different model sizes.
|
||||
}
|
||||
|
||||
\label{tab:scale}
|
||||
\resizebox{1\linewidth}{!}{
|
||||
\renewcommand{\arraystretch}{1.1}
|
||||
\begin{tabular}{l|cc|cc|cc}
|
||||
\toprule
|
||||
Backbone LLM
|
||||
& \multicolumn{2}{c|}{Qwen 3 0.6B}
|
||||
& \multicolumn{2}{c|}{Qwen 3 1.7B}
|
||||
& \multicolumn{2}{c}{Qwen 3 4B} \\
|
||||
\midrule
|
||||
Metric
|
||||
& Avg$\uparrow$ & Geo$\uparrow$
|
||||
& Avg$\uparrow$ & Geo$\uparrow$
|
||||
& Avg$\uparrow$ & Geo$\uparrow$ \\
|
||||
\midrule
|
||||
LoRA (shared)
|
||||
& 58.97 & 55.91
|
||||
& 69.75 & 68.42
|
||||
& 76.47 & 75.56 \\
|
||||
|
||||
LoRA (specific)
|
||||
& 60.66 & 58.12
|
||||
& 69.67 & 68.30
|
||||
& 76.66 & 75.76 \\
|
||||
|
||||
SHiRA (shared)
|
||||
& 56.59 & 53.21
|
||||
& 68.47 & 66.83
|
||||
& 74.60 & 73.51 \\
|
||||
|
||||
SHiRA (specific)
|
||||
& 60.74 & 57.64
|
||||
& \underline{70.96} & \underline{69.76}
|
||||
& 76.62 & 75.67 \\
|
||||
|
||||
MTLoRA
|
||||
& \underline{61.13} & \underline{58.39}
|
||||
& 70.05 & 68.61
|
||||
& \underline{76.81} & \underline{75.98} \\
|
||||
|
||||
\textbf{MESSA (ours)}
|
||||
& \textbf{61.77} & \textbf{58.65}
|
||||
& \textbf{71.93} & \textbf{70.18}
|
||||
& \textbf{78.01} & \textbf{77.18} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
}
|
||||
\end{table}
|
||||
|
||||
\begin{table*}[t]
|
||||
\centering
|
||||
\footnotesize
|
||||
\caption{Per-task performance on the five evaluation datasets using the Qwen3-4B backbone.
|
||||
BoolQ, MedQA, and HellaSwag are evaluated by accuracy (Acc), GSM8K is evaluated by exact match (EM),
|
||||
and CodeAlpaca is evaluated by instruction compliance rate (ICR).
|
||||
Avg denotes the macro average across all tasks, while Geo denotes the geometric mean.
|
||||
Higher values indicate better performance.}
|
||||
\label{tab:crosstaskresult}
|
||||
\resizebox{0.8\linewidth}{!}{
|
||||
\renewcommand{\arraystretch}{0.94}
|
||||
\begin{tabular}{l|ccccc|cc}
|
||||
\toprule
|
||||
Dataset
|
||||
& BoolQ
|
||||
& CodeAlpaca
|
||||
& MedQA
|
||||
& GSM8K
|
||||
& HellaSwag
|
||||
& \multirow{2}{*}{Avg}
|
||||
& \multirow{2}{*}{Geo} \\
|
||||
\cmidrule(lr){1-6}
|
||||
Metric
|
||||
& Acc $\uparrow$
|
||||
& ICR $\uparrow$
|
||||
& Acc $\uparrow$
|
||||
& EM $\uparrow$
|
||||
& Acc $\uparrow$
|
||||
&
|
||||
& \\
|
||||
\midrule
|
||||
LoRA (shared)
|
||||
& 86.79
|
||||
& \underline{67.45}
|
||||
& 59.81
|
||||
& 77.27
|
||||
& 91.02
|
||||
& 76.47
|
||||
& 75.56 \\
|
||||
|
||||
LoRA (specific)
|
||||
& \underline{87.89}
|
||||
& 67.40
|
||||
& 60.75
|
||||
& 76.06
|
||||
& 91.20
|
||||
& 76.66
|
||||
& 75.76 \\
|
||||
|
||||
AdaLoRA (shared)
|
||||
& 85.81
|
||||
& 66.55
|
||||
& 58.24
|
||||
& 75.61
|
||||
& 87.89
|
||||
& 74.82
|
||||
& 73.94 \\
|
||||
|
||||
AdaLoRA (specific)
|
||||
& 85.02
|
||||
& 66.75
|
||||
& 59.18
|
||||
& \underline{77.42}
|
||||
& 88.89
|
||||
& 75.45
|
||||
& 74.61 \\
|
||||
|
||||
\midrule
|
||||
SHiRA (shared)
|
||||
& 86.79
|
||||
& 64.65
|
||||
& 56.99
|
||||
& 74.85
|
||||
& 89.70
|
||||
& 74.60
|
||||
& 73.51 \\
|
||||
|
||||
SHiRA (specific)
|
||||
& 87.40
|
||||
& 63.50
|
||||
& \underline{62.64}
|
||||
& 77.73
|
||||
& \underline{91.83}
|
||||
& \underline{76.62}
|
||||
& \underline{75.67} \\
|
||||
|
||||
\midrule
|
||||
MTLoRA
|
||||
& 86.42
|
||||
& 66.35
|
||||
& 62.01
|
||||
& \underline{78.33}
|
||||
& 90.92
|
||||
& \underline{76.81}
|
||||
& \underline{75.98} \\
|
||||
|
||||
MOELoRA
|
||||
& 86.24
|
||||
& \underline{67.65}
|
||||
& 60.91
|
||||
& 75.61
|
||||
& 89.92
|
||||
& 76.07
|
||||
& 75.27 \\
|
||||
|
||||
\textbf{MESSA (ours)}
|
||||
& \textbf{88.07}
|
||||
& \textbf{68.30}
|
||||
& \textbf{62.79}
|
||||
& \textbf{78.33}
|
||||
& \textbf{92.57}
|
||||
& \textbf{78.01}
|
||||
& \textbf{77.18} \\
|
||||
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
}
|
||||
\end{table*}
|
||||
\section{Experiments}
|
||||
To comprehensively evaluate the performance of our proposed MESSA, we conduct extensive experiments guided by the following key research questions (RQs):
|
||||
|
||||
\begin{itemize}[leftmargin=*]
|
||||
\item \textbf{RQ1:} Does MESSA improve multi-task performance with similar global budget compared to strong PEFT baselines?
|
||||
\item \textbf{RQ2:} How does MESSA scale with backbone LLMs of different parameter sizes?
|
||||
\item \textbf{RQ3:} How do different components of MESSA contribute to its effectiveness under a unified budget?
|
||||
\item \textbf{RQ4:} What structural allocation patterns does MESSA learn across attention modules in multi-task adaptation?
|
||||
\end{itemize}
|
||||
|
||||
We first introduce the experimental setup and then systematically address each of the above research questions.
|
||||
|
||||
|
||||
\subsection{Experimental Setup}
|
||||
\paragraph{Datasets.}
|
||||
We evaluate our method on five diverse tasks including BoolQ~\cite{clark2019boolq}, CodeAlpaca~\cite{codealpaca}, MedQA~\cite{jin2020disease}, GSM8K~\cite{cobbe2021gsm8k}, and HellaSwag~\cite{zellers2019hellaswag}, which cover heterogeneous reasoning and generation scenarios for evaluating multi-task adaptation.
|
||||
For each task, we use its standard primary evaluation metric (Accuracy for classification-based reasoning tasks, Exact Match for GSM8K, and Instruction Compliance Rate for CodeAlpaca) and report three aggregated metrics, including Macro Average, Geometric Mean, and Worst-Task, to reflect average performance, balance, and robustness.
|
||||
Further details are provided in the Appendix.
|
||||
|
||||
\paragraph{Backbone Models.}
|
||||
We conduct experiments on multiple pre-trained LLM backbones including \textbf{Qwen 3}~\cite{qwen3technicalreport}, \textbf{LLaMA 3.2}~\cite{grattafiori2024llama}, and \textbf{Gemma 3}~\cite{gemma_2025} to evaluate performance and scalability.
|
||||
|
||||
\paragraph{Baseline Methods.}
|
||||
We compare our method with representative PEFT approaches from three categories.
|
||||
\textbf{Low-rank PEFT} baselines include LoRA~\cite{hu2021lora} and AdaLoRA~\cite{zhang2023adalora}.
|
||||
\textbf{Sparse PEFT} baselines include SHiRA~\cite{shiracite}.
|
||||
For these task-agnostic methods, we evaluate both \emph{task-specific} and \emph{shared} settings, where \emph{task-specific} assigns each task its own individual PEFT module, while \emph{shared} uses a single PEFT module shared across all tasks.
|
||||
In addition, MTLoRA~\cite{agiza2024mtlora} and MOELoRA~\cite{liu2024moe} are included as \textbf{multi-task-oriented PEFT} baselines.
|
||||
All methods are compared under matched parameter budgets for fairness.
|
||||
|
||||
|
||||
\subsubsection{Implementation Details}
|
||||
All experiments are conducted on NVIDIA GeForce RTX 4090 using PyTorch and HuggingFace Transformers.
|
||||
We use an AdamW optimizer with a learning rate of 1e-4.
|
||||
MESSA is applied to attention layers under a $2.5\%$ parameter budget, with gate warmup ratio $5\%$, and pruning at $15\%$ of training. Further details are provided in the Appendix and our code\footnote{\codelink}.
|
||||
\subsection{Overall Multi-Task Performance (RQ1)}
|
||||
\label{sec:rq1}
|
||||
|
||||
We first compare MESSA with strong PEFT baselines under a unified parameter budget to evaluate overall multi-task effectiveness.
|
||||
Table~\ref{tab:exp1} reports the overall multi-task performance across three backbone LLMs under a unified parameter budget.
|
||||
MESSA consistently achieves the best results on all backbones, while using fewer trainable parameters than all baselines.
|
||||
Single-task-oriented PEFT methods, such as LoRA, AdaLoRA, and SHiRA, are not designed for budget-constrained multi-task adaptation.
|
||||
When extended to multi-task settings, they either enforce fully shared adaptations across tasks or allocate independent modules for each task.
|
||||
The former lacks the flexibility to capture task-specific variations, while the latter leads to inefficient parameter usage and suboptimal budget allocation when multiple tasks compete for limited adaptation capacity.
|
||||
While multi-task PEFT approaches such as MTLoRA and MOELoRA explicitly consider multiple tasks through routing or mixture mechanisms, they typically rely on heuristic or task-agnostic parameter allocation and do not model global budget competition at the structural level, preventing shared and task-specific parameters from being jointly optimized under a unified constraint.
|
||||
|
||||
To further understand how these overall gains are distributed across individual tasks, Table~\ref{tab:crosstaskresult} reports the per-task performance on the Qwen 3-4B backbone.
|
||||
MESSA improves performance on all five tasks, suggesting MESSA effectively balances shared and task-specific adaptations, enabling improvements across heterogeneous tasks rather than overfitting to a subset of them.
|
||||
|
||||
\subsection{Scalability across Backbone Sizes (RQ2)}
|
||||
\label{sec:rq2}
|
||||
|
||||
We next examine how MESSA scales with backbone LLMs of different parameter sizes.
|
||||
Table~\ref{tab:scale} reports results on Qwen 3 backbones ranging from 0.6B to 4B parameters.
|
||||
Across all model sizes, MESSA consistently achieves the best overall performance, indicating that its advantages are not limited to a specific model scale.
|
||||
Notably, the performance gains of MESSA remain stable as the backbone size increases, demonstrating that MESSA scales robustly with model size and can effectively exploit different backbones while maintaining parameter efficiency under a unified budget.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=1\linewidth]{assets/combined_ablation_module.pdf}
|
||||
\caption{(a) Ablation study of MESSA, showing the impact of different components on overall multi-task performance.
|
||||
(b) Selection rates of shared and task-specific updates across attention modules.}
|
||||
\label{fig:analysis}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Ablation and Structural Analysis (RQ3, 4)}
|
||||
\label{sec:analysis}
|
||||
Figure~\ref{fig:analysis}(a) presents an ablation study under the same unified budget.
|
||||
Removing any core component of MESSA leads to a consistent performance drop, indicating that its effectiveness relies on the joint design.
|
||||
In particular, gate warmup and soft-to-hard structure learning are important for discovering stable sparse structures, while overlap control between shared and task-specific updates helps avoid redundant parameter allocation under a global budget.
|
||||
|
||||
|
||||
Figure~\ref{fig:analysis}(b) illustrates the learned structural allocation across attention modules.
|
||||
Shared sparse updates are more frequently selected in the key projection, which can be intuitively attributed to its role in defining task-agnostic attention compatibility and thus serving as a natural target for shared adaptation under a unified budget.
|
||||
|
||||
|
||||
\section{Related Work}
|
||||
|
||||
|
||||
\paragraph{Parameter-Efficient Fine-Tuning (PEFT).}
|
||||
PEFT methods adapt large language models by updating only a small subset of parameters while keeping the backbone weights frozen.
|
||||
Early PEFT methods achieve parameter efficiency by introducing lightweight task-specific adaptation modules, including adapters, continuous prompts, and low-rank reparameterizations such as LoRA~\cite{pfeiffer2020adapterhub,li2021prefix,hu2021lora}.
|
||||
Although effective, these approaches rely on auxiliary components, leading to architectural modifications and additional complexity, especially in multi-task settings.
|
||||
More recently, sparse fine-tuning has emerged as an alternative PEFT paradigm that directly learns sparse updates in the original weight space~\cite{sanh2020movement,ansell2024scaling,shiracite}.
|
||||
Sparse fine-tuning directly updates a small subset of backbone parameters, avoiding auxiliary modules and additional inference overhead.
|
||||
Nevertheless, prior work primarily focuses on single-task settings and lacks explicit mechanisms for budget-aware sparse allocation in multi-task adaptation.
|
||||
|
||||
|
||||
\paragraph{Multi-Task Adaptation for LLMs.}
|
||||
|
||||
Multi-task adaptation enables a single model to support multiple tasks simultaneously.
|
||||
A common strategy extends PEFT methods to multi-task scenarios by introducing task-specific adaptation components, such as fusing adapters across tasks or jointly training multiple lightweight modules~\cite{pfeiffer2020adapterfusion,mao2022unipelt,sheng2023s}.
|
||||
Another line of work incorporates routing or mixture-of-experts mechanisms, where multiple adaptation modules (e.g., LoRA or adapter experts) are dynamically selected or weighted for different tasks or inputs~\cite{agiza2024mtlora,liu2024moe}.
|
||||
While effective at modeling task diversity, these methods rely on auxiliary modules or routing mechanisms and typically allocate adaptation capacity in a per-task or heuristic manner, without explicitly modeling global competition under a unified parameter budget.
|
||||
These limitations motivate a budget-aware multi-task adaptation approach that can jointly optimize shared and task-specific structures without introducing auxiliary modules.
|
||||
|
||||
|
||||
|
||||
|
||||
\section{Conclusion}
|
||||
We propose MESSA, a shared-specific sparse fine-tuning framework for budget-constrained multi-task adaptation of LLMs.
|
||||
MESSA formulates multi-task sparse fine-tuning as a structure allocation problem under a unified parameter budget and addresses it through a shared-specific decomposition coupled with a budget-aware soft-to-hard structure learning strategy.
|
||||
By jointly learning shared and task-specific sparse structures and hardening them into a fixed, deployable model, MESSA achieves strong multi-task performance without modifying the backbone architecture or introducing inference overhead.
|
||||
Extensive experiments across diverse tasks and backbones demonstrate that MESSA consistently outperforms existing PEFT methods under identical parameter budgets, achieving superior performance, better task-balanced performance, and improved robustness.
|
||||
These results highlight the importance of budget-aware structural allocation for effective multi-task adaptation and suggest a promising direction for scalable and deployable sparse fine-tuning of LLMs.
|
||||
|
||||
\appendix
|
||||
\section{Evaluation Protocol}
|
||||
|
||||
In this work, we evaluate multi-task performance using task-specific primary metrics and report three aggregated metrics to reflect overall performance, balance, and robustness across tasks.
|
||||
|
||||
\subsection{Task-Specific Evaluation Metrics}
|
||||
|
||||
For each task, we adopt its standard evaluation metric following prior work:
|
||||
\begin{itemize}
|
||||
\item \textbf{BoolQ~\cite{clark2019boolq}, MedQA~\cite{jin2020disease}, HellaSwag~\cite{zellers2019hellaswag}}: Accuracy (Acc), defined as the proportion of correctly predicted answers.
|
||||
\item \textbf{GSM8K~\cite{cobbe2021gsm8k}}: Exact Match (EM), which measures the percentage of predictions that exactly match the ground-truth numerical answer after normalization.
|
||||
\item \textbf{CodeAlpaca~\cite{codealpaca}}: Instruction Compliance Rate (ICR), which measures the proportion of model outputs that successfully follow the instruction and produce a valid code response according to task-specific compliance rules.
|
||||
\end{itemize}
|
||||
|
||||
All metrics are computed independently for each task on their respective test sets.
|
||||
|
||||
\subsection{Metric Normalization}
|
||||
|
||||
All primary metrics naturally lie in the range $[0,1]$, and therefore no additional rescaling or normalization is applied prior to aggregation.
|
||||
|
||||
\subsection{Aggregated Multi-Task Metrics}
|
||||
|
||||
Let $s_t \in [0,1]$ denote the primary evaluation score for task $t$, and let $T$ be the total number of tasks.
|
||||
|
||||
We report the following three aggregated metrics:
|
||||
\begin{itemize}
|
||||
\item \textbf{Macro Average}:
|
||||
\begin{equation}
|
||||
\text{MacroAvg} = \frac{1}{T} \sum_{t=1}^{T} s_t,
|
||||
\end{equation}
|
||||
which reflects the overall average performance across tasks.
|
||||
|
||||
\item \textbf{Geometric Mean}:
|
||||
\begin{equation}
|
||||
\text{GeoMean} = \exp\left( \frac{1}{T} \sum_{t=1}^{T} \log s_t \right),
|
||||
\end{equation}
|
||||
which emphasizes balanced performance and penalizes large disparities across tasks.
|
||||
|
||||
\item \textbf{Worst-Task Performance}:
|
||||
\begin{equation}
|
||||
\text{Worst} = \min_{t \in \{1,\dots,T\}} s_t,
|
||||
\end{equation}
|
||||
which measures robustness by capturing the weakest-task performance.
|
||||
\end{itemize}
|
||||
Note that aggregated evaluation metrics are used for reporting
|
||||
and are not involved in model selection or early stopping.
|
||||
|
||||
\section{Experimental Setup}
|
||||
|
||||
\subsection{Tasks and Datasets}
|
||||
|
||||
We evaluate all methods on five diverse tasks:
|
||||
BoolQ (reading comprehension),
|
||||
CodeAlpaca (code generation),
|
||||
MedQA (medical question answering),
|
||||
GSM8K (mathematical reasoning),
|
||||
and HellaSwag (commonsense reasoning).
|
||||
A unified prompt format is used within each task, and the maximum sequence length is set to 2000 tokens for all experiments.
|
||||
|
||||
\subsection{Data Splits and Reproducibility}
|
||||
|
||||
For datasets with predefined validation sets (BoolQ and HellaSwag), we split the validation set evenly into development and test subsets.
|
||||
For datasets that only provide a test split (MedQA and GSM8K), we similarly split the test set into development and test subsets with a 1:1 ratio.
|
||||
For CodeAlpaca, which contains only a training split, we partition the data into train, development, and test sets using a 7:2:1 ratio.
|
||||
All dataset splits are created using a fixed random seed (42) to ensure reproducibility.
|
||||
|
||||
\subsection{Task Sampling}
|
||||
|
||||
We adopt an epoch-based mixed task sampling strategy.
|
||||
At each epoch, mini-batches from all tasks are constructed independently and then shuffled into a single global sequence.
|
||||
Each mini-batch contains samples from only one task, enabling task-specific gating,
|
||||
while the randomized batch order ensures balanced multi-task optimization.
|
||||
All training samples are visited exactly once per epoch.
|
||||
|
||||
\subsection{Training Setup}
|
||||
|
||||
We implement all methods using PyTorch and DeepSpeed ZeRO-2 with CPU offloading to reduce GPU memory consumption.
|
||||
All experiments are conducted on NVIDIA GeForce RTX 4090 GPUs with BF16 mixed-precision training.
|
||||
We adopt the AdamW optimizer with $\beta_1=0.9$ and $\beta_2=0.95$, and employ a cosine learning rate schedule with a warmup ratio of 10\%.
|
||||
All models are trained using early stopping based on the validation loss.
|
||||
|
||||
\subsection{Implementation Details}
|
||||
All experiments are implemented in Python 3.12.3 using PyTorch 2.7.0 with CUDA 12.8.
|
||||
We use Hugging Face Transformers 4.51.0 for model loading, PEFT 0.17.0 for baseline implementations, and DeepSpeed 0.18.4 to accelerate training.
|
||||
Additional dependencies include Datasets 4.4.1, Accelerate 1.9.0, and NumPy 2.2.6.
|
||||
|
||||
\subsection{MESSA Configuration}
|
||||
|
||||
For sparse structure learning in MESSA, we set the learning rate to $1\times10^{-4}$ and use an effective batch size of 8, implemented as 2 samples per device with 4 steps of gradient accumulation.
|
||||
MESSA is applied to the attention projection layers (Q, K, V, and O) under a unified parameter budget of 2.5\% relative to the backbone model.
|
||||
The candidate pool factor is set to 1.5.
|
||||
The gate warmup phase occupies the first 5\% of training steps, followed by one-shot hard pruning at 15\% of the total training steps.
|
||||
We enable mutual exclusion between shared and task-specific updates via overlap regularization, allowing up to 15\% overlap during soft structure learning.
|
||||
After pruning, unselected parameter groups are permanently frozen and soft gates are converted into hard binary masks.
|
||||
|
||||
\subsection{Baseline Configurations}
|
||||
|
||||
All baselines are configured to ensure fair comparison under similar parameter budgets.
|
||||
For LoRA and SHiRA, we use rank $r=180$ in the shared setting and $r=36$ in the task-specific setting, as both methods adopt comparable low-rank parameterizations.
|
||||
For AdaLoRA, which adaptively adjusts rank during training, we set the initial rank to $r=100$ (shared) and $r=20$ (task-specific).
|
||||
For MTLoRA, we use $r_{\text{shared}}=r_{\text{task}}=36$ with the \texttt{matrixv2} fusion mode.
|
||||
For MOELoRA, we configure five experts corresponding to the number of tasks, each with rank $r=180$ and task embedding dimension 64.
|
||||
Reference in New Issue
Block a user