\title{Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models} \begin{abstract} Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65\%. The implemented code and data are available to ease reproducibility.\footnote{https://github.com/Applied-Machine-Learning-Lab/HyCAM} \end{abstract} \input{0_misc} \section{Introduction} \label{sec:intro} Large Language Models (LLMs) have demonstrated remarkable capabilities by their extensive general knowledge and powerful reasoning abilities~\cite{achiam2023gpt, team2023gemini}. More than just a conversation, these models are increasingly proving invaluable as core components in advanced information retrieval~\cite{li2023e4srec, li2023web}, critical decision-making systems~\cite{brynjolfsson2025generative, wang2023rethinking}, and spatiotemporal applications~\cite{cheng2025poi, zhang2024veccity}. The success has led to increasing demand for adapting such models to specialized domains and, more importantly, to handle multiple diverse tasks simultaneously. This capability is essential for effective deployment in real-world applications~\cite{bommasani2021opportunities, yu2025bigcity, ji2025seeing}. Supervised Fine-Tuning (SFT), a widely adopted adaptation approach, involves further tuning a pre-trained model on task-specific instruction data~\cite{wei2021finetuned}. However, achieving effective adaptation remains significant challenges. Conventional full parameter fine-tuning, a common SFT implementation that updates all parameters, needs to achieve effective adaptation while preserving foundational capabilities. The training process on a narrow task-specific dataset can significantly change the model's pre-trained weights, leading to catastrophic forgetting~\cite{lester2021power}. Furthermore, such an approach typically demands substantial computational resources. Such limitations hinder its applicability in many practical scenarios, especially in multi-task settings~\cite{wang2023multi, fu2025training}, where models must balance between generalization and specialization. To address these limitations, various Parameter-Efficient Fine-Tuning (PEFT) techniques have been proposed. These approaches adapt pre-trained LLMs to new tasks by updating only a small number of trainable parameters while leaving the backbone model unchanged, thereby reducing computational cost and overfitting risks~\cite{han2024parameter}. Common PEFT strategies include adapter-based methods~\cite{houlsby2019parameter} that insert lightweight trainable modules, prompt-based methods such as Prefix Tuning~\cite{li2021prefix} that modify input representations, and reparameterization methods like Low-Rank Adaptation (LoRA)~\cite{hu2021lora} and its variants. LoRA, a widely utilized PEFT method, employs low-rank decomposition to weight updates, making it both efficient and effective. However, these methods face limitations in complex multi-task scenarios due to their limited generalization and representational capacity across diverse tasks and potential interference when adapting to multiple objectives simultaneously~\cite{yu2020gradient, liu2021conflict, navon2022multi}. Specifically for low-rank reparameterization approaches like LoRA, the low-rank adaptability may restrict model expressiveness when applied to highly complex tasks, resulting in suboptimal performance~\cite{pan2024lisa}. While strategies like incorporating the Mixture-of-Experts (MoE) mechanism, which combines multiple specialized PEFT modules for multi-task adaptation, aim to enhance model capacity for diverse tasks, these MoE-based approaches can introduce additional challenges, including mitigating coupling effects and effectively managing the contributions of different experts~\cite{rajbhandari2022deepspeed}. Overall, adapting LLMs to diverse tasks presents two major challenges: (1) preserving rich pretrained general knowledge while specializing for specific tasks, and (2) extending the multi-task capabilities of Parameter-Efficient Methods. Our approach is motivated by a key observation regarding LLM architectures: different components in the Transformer reveal different roles and activation behaviors. Existing literature suggests that Feed-Forward Network (FFN) layers, constituting the bulk of model parameters, primarily function as key repositories for storing and recalling general knowledge~\cite{geva2021transformer}. In contrast, self-attention mechanisms are primarily responsible for processing and integrating contextual information within the input sequence, capturing dependencies between tokens~\cite{jin2025massive}. This functional difference is also reflected in the parameter activation. While FFNs, comprising approximately 90\% of model parameters, exhibit high activation sparsity, self-attention mechanisms typically demonstrate denser activation patterns~\cite{cai2024survey, fedus2022switch, jaszczur2021sparse}. This denser engagement highlights its critical role in integrating latent general knowledge with contextual information derived from the input. Given these differences, we argue that focusing on the modulation of self-attention during multi-task adaptation provides a more effective and specialized strategy. The key insight is that large-scale pre-training has equipped LLMs with extensive general knowledge, so effective adaptation should focus on enabling LLMs to better integrate task-specific contextual information. Such an approach can refine how general knowledge is integrated with specific contextual demands of diverse tasks. Importantly, this modulation preserves pre-trained general knowledge, thereby mitigating issues like catastrophic forgetting and task interference. To this end, we introduce Contextual Attention Modulation (CAM), a novel mechanism designed to dynamically modulate the representations within the self-attention modules of LLMs based on the input context. CAM learns to dynamically modulate self-attention representations to adapt the input context. This context-aware mechanism selectively amplifies task-relevant attentional signals and suppresses irrelevant or interfering ones, thereby enhancing task-specific features while preserving the model's pre-trained general knowledge. Directly modulating the organization of contextual information within attention modules promotes more effective knowledge retention and specialized adaptation, thereby supporting more robust and efficient multi-task learning. To extend the multi-task capabilities, we embed CAM into our Hybrid Contextual Attention Modulation (HyCAM) framework. HyCAM combines a shared, full-parameter CAM module, which is designed to capture and leverage common knowledge across all tasks, with multiple specialized, lightweight CAM modules. These specialized modules implement the CAM mechanism using PEFT techniques to efficiently capture distinct features, allowing effective multi-task adaptation with minimal additional trainable parameters. A soft-routing strategy, further augmented by a load-balancing constraint, dynamically manages the fusion of knowledge from these shared and specialized CAM components. This design empowers HyCAM to extend multi-task performance by enabling both efficient knowledge sharing and fine-grained specialization. The main contributions of this paper are summarized as follows: \begin{itemize}[leftmargin=*, topsep=0pt] \item We propose Contextual Attention Modulation (CAM), a novel mechanism that learns to dynamically modulate self-attention representations in LLMs based on input context. CAM is designed to enhance task-specific features while preserving pre-trained general knowledge, thereby facilitating more effective knowledge retention and specialized adaptation. \item We introduce the Hybrid Contextual Attention Modulation (HyCAM) framework, which extends multi-task adaptation capabilities by integrating our CAM mechanism in distinct forms. This integration empowers HyCAM to achieve superior multi-task performance by effectively balancing efficient knowledge sharing with fine-grained task specialization. \item We conduct extensive experiments across a range of tasks covering question answering, code generation, logical reasoning, and other domains. Comparative experiments demonstrate that HyCAM significantly outperforms existing state-of-the-art approaches with faster convergence. \end{itemize} \section{Preliminaries} This section briefly reviews the fundamental concepts essential for understanding our proposed method. We discuss the relevant components of the Transformer architecture, the basics of task-adaptive fine-tuning, and common PEFT techniques. \subsection{Transformer Architecture} The Transformer architecture~\cite{vaswani2017attention} serves as the backbone of most LLMs owing to its ability to efficiently process sequences of data through attention mechanisms, making it especially powerful for understanding and generating human language. A Transformer model is typically composed of a stack of identical blocks. Each block primarily contains two core components: the self-attention mechanism and the Feed-Forward Network (FFN). Self-Attention mechanism allows the model to weigh the importance of different tokens in an input sequence and capture contextual relationships by computing attention scores using Query ($Q$), Key ($K$), and Value ($V$) projections, often via scaled dot-product attention: $Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$. Following this, the FFN, typically composed of two linear transformations with a non-linear activation, further processes each token's representation independently to express the complex knowledge of the model. \subsection{Task-Adaptive Fine-Tuning} \label{sec:finetune} While LLMs acquire extensive general knowledge and reasoning capabilities, they typically require further adaptation to specialize them for specific tasks and align their behavior with desired objectives, such as following instructions. A common approach for such task-adaptive fine-tuning is Supervised Fine-Tuning (SFT). In SFT, the model learns from examples that provide explicit input-output pairings. These pairings might illustrate a question with its corresponding answer or an instruction followed by the desired model output. The primary goal is to adjust the model's parameters to minimize a task-specific loss function, such as cross-entropy loss for sequence generation or classification tasks. \subsection{Parameter-Efficient Fine-Tuning} Adapting LLMs to specific tasks often involves fine-tuning, but updating all parameters is computationally expensive. PEFT methods enable model adaptation by introducing a small set of new parameters or reparameterizing existing ones while keeping the backbone model weights frozen, significantly reducing computational costs. A mainstream PEFT category is reparameterization, which introduces a smaller set of trainable parameters that efficiently influence the model's behavior. For instance, a common strategy is to represent the change in a pre-trained weight matrix $W_0$ during adaptation as a low-rank update, based on the observation that task-specific changes often lie in a subspace of much lower dimensionality than the full parameter space. Thus, instead of learning a large, dense update matrix $\Delta W$, these methods learn a low-rank approximation of it, such as $\Delta W = BA$, where $B$ and $A$ are much smaller matrices~\cite{hu2021lora}. \begin{figure*}[ht] \centering \includegraphics[width=0.75\linewidth]{assets/model_v4.pdf} \caption{The architecture of the CAM and HyCAM framework. HyCAM applies a hybrid CAM mechanism to the output of the Attention module within each Transformer block, while the backbone LLM remains frozen. Specifically, HyCAM integrates a shared, full-parameter CAM module and multiple lightweight Specialized CAMs for common and task-specific knowledge.} \label{fig:model} \end{figure*} \section{Method} We first illustrate an overview of our proposed HyCAM framework. Next, the core CAM mechanism is further detailed. We then provide an in-depth description of the HyCAM framework, including its hybrid components and dynamic knowledge fusion strategies with a soft-routing method and load-balancing constraint, and conclude by specifying the training objective. \subsection{Framework Overview} To address the critical challenge of enabling LLMs to efficiently adapt to diverse tasks while balancing knowledge retention with task-specific specialization, we introduce the Hybrid Contextual Attention Modulation (HyCAM) framework. The core mechanism of HyCAM is Contextual Attention Modulation (CAM), which dynamically learns context-dependent modulation of self-attention representations, selectively amplifying task-relevant signals while suppressing irrelevant or potentially interfering ones to enhance task-specific features and preserve general knowledge. As illustrated in Figure~\ref{fig:model}, the HyCAM framework employs a novel hybrid architecture that integrates a shared, full-parameter CAM module, designed for capturing common knowledge across tasks, with multiple specialized CAM modules that utilize parameter-efficient techniques for efficient, fine-grained adaptation to distinct task features. The contributions of these diverse CAM modules are managed by a dynamic routing strategy to ensure balanced utilization of the specialized components and adaptive knowledge fusion. \subsection{Contextual Attention Modulation} \label{sec:cam} The CAM mechanism is the core of our HyCAM framework, designed to dynamically modulate self-attention representations at each Transformer block. It learns to dynamically amplify task-relevant attentional signals and suppress irrelevant ones based on the input context, thereby enhancing task-specific features while preserving the model's pre-trained general knowledge, which facilitates more effective and efficient task adaptation. \subsubsection{\textbf{Motivation}} Our motivation for developing CAM comes from the analysis of distinct roles and activation patterns across different Transformer components, as described in Section~\ref{sec:intro}. While FFN modules account for a large portion of parameters and store a vast amount of an LLM's parameterized knowledge, self-attention modules are crucial for dynamically processing and integrating contextual information. The varying activation patterns of these components highlight the important role of the self-attention modules in integrating latent general knowledge with the specific context derived from an input. With extensive general knowledge from large-scale pre-training of LLMs, the key to effective adaptation lies in enabling them to better integrate this foundational knowledge with task-specific contextual information. Conventional fine-tuning approaches, however, can often overwrite the valuable pre-trained representations by introducing new task-specific knowledge. This observation motivated us to develop CAM, a mechanism that refines how general knowledge is integrated with specific contextual demands of diverse tasks by modulating self-attention representations. This approach aims to facilitate task-adaptive specialization while preserving valuable pre-trained knowledge. \subsubsection{\textbf{The CAM Mechanism}} \label{subsec:camdetail} The CAM mechanism is integrated into each Transformer block, operating on the output of the self-attention modules to dynamically modulate its representations based on the input context. This process allows for a fine-grained modulation of contextual information flow. Specifically, the CAM mechanism proceeds as follows: \paratitle{Input Normalization: } Let $h_{in} \in \mathbb{R}^{L \times d}$ be the input hidden state to a Transformer layer, where $L$ denotes the sequence length and $d$ represents the hidden dimension. Consistent with standard Transformer operations, these input hidden states are first normalized using Layer Normalization~\cite{ba2016layer}, producing $h_{norm} \in \mathbb{R}^{L \times d}$: \begin{equation} h_{norm} = \text{LayerNorm}(h_{in}). \end{equation} The resulting $h_{norm}$ serves as the input for both the conventional self-attention computation and our CAM module. \paratitle{Modulation Weight Generation: } CAM then computes a context-dependent modulation weight tensor, denoted as $\mathbf{A}_{\text{CAM}} \in \mathbb{R}^{L \times d}$. These weights are derived from the normalized hidden state $h_{norm}$ through a linear projection parameterized by a trainable weight matrix $W_{proj} \in \mathbb{R}^{d \times d}$, followed by a SiLU activation function~\cite{elfwing2018sigmoid}: \begin{equation} \mathbf{A}_{\text{CAM}} = \text{SiLU}(h_{norm} W_{proj}). \end{equation} The matrix $W_{proj}$ is specific to the CAM module and is crucial for learning how to modulate the attention representations based on the input context. To ensure stability during the initial phases of fine-tuning and to allow the model to gradually learn the modulation, $W_{proj}$ is initialized as a zero matrix. This initialization strategy ensures that at the beginning of the fine-tuning, CAM does not alter the pre-trained model's behavior. That is, the model initially maintains its original approach to processing contextual information, which is then gradually modulated as training progresses for a stable adaptation. \paratitle{Application of Modulation: } Concurrently, the standard attention output $h_{att} \in \mathbb{R}^{L \times d}$ is computed using the normalized input $h_{norm}$: \begin{equation} h_{att} = \text{Self-Attention}(h_{norm}). \label{eq:oriattn} \end{equation} The CAM mechanism then refines this $h_{att}$ by applying the learned modulation weights $\mathbf{A}_{\text{CAM}}$. This is performed via an element-wise Hadamard product ($\odot$). The modulated signal is integrated with the original $h_{att}$ through a residual connection, forming the final output $h_{out} \in \mathbb{R}^{L \times d}$ of the attention mechanism incorporating CAM. \begin{equation} h_{out} = h_{att} + h_{att} \odot \mathbf{A}_{\text{CAM}}. \end{equation} \subsubsection{\textbf{Advantages}} By dynamically generating and applying these modulation weights, CAM refines the contextual representation from the self-attention modules to adapt it to specific tasks while preserving the pre-trained general knowledge, thereby mitigating catastrophic forgetting. Thus, CAM facilitates an effective balance between achieving task-specific adaptation and retaining extensive general knowledge. Moreover, by modulating attentional outputs instead of fine-tuning a large number of backbone parameters, CAM achieves computational efficiency. \vspace{-5px} \subsection{The HyCAM Framework} While the CAM mechanism provides a powerful tool for modulating attention representations, adapting LLMs to handle multiple diverse tasks simultaneously presents significant challenges. Conventional full fine-tuning struggles with catastrophic forgetting and resource demands, while existing PEFT methods still face limitations for multi-tasking. Specifically, the limited capacity of representation makes it suboptimal for highly complex tasks, and simple applications of expert-based strategies lead to an imbalance in expert utilization. To address these multiple challenges and effectively leverage the CAM mechanism for complex multi-task learning scenarios, we introduce the HyCAM framework. The framework is designed to extend the multi-task adaptation capabilities by integrating CAM in hybrid forms, enabling both efficient knowledge sharing and fine-grained task specialization. This is achieved through strategically combining shared, full-parameter CAM module, for efficient knowledge sharing, with multiple specialized, parameter-efficient CAM modules, for fine-grained specialization. The contributions of these components are coordinated by a dynamic routing mechanism with a load-balancing constraint to ensure adaptive knowledge fusion. \subsubsection{\textbf{Hybrid CAM Components}} The hybrid architecture of the HyCAM framework is designed to leverage both general context understanding and specialized, task-specific adaptation capabilities. This architecture comprises a shared, full-parameter CAM module and multiple lightweight, specialized CAM modules: \paratitle{Shared CAM Module: } The Shared CAM module serves as a global modulator, for capturing and refining common contextual patterns and general knowledge across all tasks. This module is a full-parameter CAM, as detailed in Section~\ref{sec:cam}. Its trainable projection matrix, denoted as $W_{Shared} \in \mathbb{R}^{d \times d}$, is shared and updated across all tasks to produce a modulation weight tensor: \begin{equation} \mathbf{A}_{Shared} = \text{SiLU}(h_{norm}W_{Shared}). \end{equation} \paratitle{Specialized CAM Modules: } In addition to the shared module, HyCAM incorporates multiple ($N_s$) lightweight Specialized CAM modules. Specialized CAM modules are designed to learn and apply attention modulations for the distinct features of specific tasks. Different tasks often require different ways of handling contextual information in the self-attention layer. For example, code generation may need to focus on long-range dependencies, while question answering systems may prioritize specific entities and their relationships in a local context. This design is to enable the model to develop fine-grained adaptations for diverse tasks, thereby mitigating the interference when a single component attempts to learn potentially conflicting objectives from multiple tasks. The implementation of Specialized CAM modules leverages the PEFT technique for reducing the number of trainable parameters per specialized module, making the framework scalable. Besides, it helps in mitigating overfitting, especially when task-specific data might be limited. Specifically, each Specialized CAM module, indexed by $k \in \{1, ..., N_s\}$, generates its unique modulation weight tensor $\mathbf{A}_{\text{Spec}_k} \in \mathbb{R}^{L \times d}$ as follows: \begin{equation} \mathbf{A}_{\text{Spec}_k} = \text{SiLU}(h_{norm} W_{\text{Spec}_k}), \end{equation} where $W_{\text{Spec}_k}$ is the trainable projection matrix specific to the $k$-th specialized module. To achieve parameter efficiency while enhancing representational capacity, we adopt the SLoRA~\cite{guo2025nlora} technique for the structure of $W_{Spec_k}$. Instead of a direct low-rank decomposition like LoRA, typically $W = BA$, SLoRA introduces an intermediate trainable matrix $N$ between $B$ and $A$. Thus, $W_{Spec_k}$ is parameterized as: \begin{equation} W_{Spec_k} = B_k N_k A_k. \end{equation} Here, $A_k \in \mathbb{R}^{r \times d}$ is a matrix that projects the $d$-dimensional hidden state $h_{norm}$ into a lower-dimensional space of rank $r$. $N_k \in \mathbb{R}^{r \times r}$ is a trainable intermediate matrix within the low-rank space. $B_k \in \mathbb{R}^{d \times r}$ is a matrix that projects the $r$-dimensional representation back to the original $d$-dimensional space. The rank $r$ is significantly smaller than $d$, ensuring a substantial reduction in trainable parameters compared to a full $d \times d$ matrix. For initialization, and similar to the zero-initialization of $W_{\text{Shared}}$ in the Shared CAM module, we adopt a strategy to ensure training stability. Specifically, the matrices $A_k$ and $N_k$ are initialized using Kaiming Uniform~\cite{he2015delving}. The matrix $B_k$ is initialized with zeros. This structure allows each Specialized CAM to develop task-specific modulations with very small parameters, thus enhancing the adaptability of the model without sacrificing efficiency. \subsubsection{\textbf{Dynamic Routing}} \label{sec:routing} To effectively leverage the diverse contributions from the Shared CAM and multiple Specialized CAM modules, HyCAM incorporates a dynamic soft-routing mechanism coupled with a load-balancing constraint. This mechanism adaptively determines the influence of each module based on the input context and promotes load-balance to ensure efficient utilization of all Specialized CAMs. \paratitle{Routing for Specialized CAMs: } The dynamic routing mechanism weights the contributions of the $N_s$ Specialized CAM modules for each input token. This enables HyCAM to adapt its modulation strategy in a fine-grained, context-dependent manner. The routing process is detailed as follows: For each token representation $h_{norm}$, derived from $h_{in}$ as described in Section~\ref{subsec:camdetail}, a lightweight router network first generates $\mathbf{logits} \in \mathbb{R}^{N_s}$, produced by a linear layer applied to $h_{norm}$: \begin{equation} \mathbf{logits} = h_{norm} W_{router}, \end{equation} where $W_{router} \in \mathbb{R}^{d \times N_s}$ is the trainable weight matrix of the router. These $\mathbf{logits}= [\pi_1, \pi_2, ..., \pi_{N_s}]$ are then transformed into a probability distribution over the specialized modules using the Gumbel-Softmax estimator~\cite{jang2016categorical} to obtain differentiable, soft routing probabilities. The Gumbel-Softmax allows for differentiable sampling from a categorical distribution, which facilitates the training process while encouraging exploration, as detailed: \begin{equation} p_k = \frac{\exp((\log \pi_k + g_k)/\tau)}{\sum_{j=1}^{N_s} \exp((\log \pi_j + g_j)/\tau)}, \label{eq:gumbel_softmax} \end{equation} where $p_k$ is the resulting soft routing weight for the $k$-th Specialized CAM module. $g_k \sim \text{Gumbel}(0,1)$ are \iid noise drawn from the Gumbel distribution, adding stochasticity for exploration. $\tau$ is a temperature hyperparameter that controls the sharpness of the probability distribution. Lower temperatures make the selection more discrete, while higher temperatures make it softer. \paratitle{Load Balancing Loss: } To prevent routers from over-selecting a few modules, HyCAM adds a load-balancing loss $\mathcal{L}_{balance}$ that encourages more balanced routing across specialized components. For a batch of $B$ tokens, it is computed as: \begin{equation} \mathcal{L}_{balance} = \sum_{k=1}^{N_s} \left( \frac{1}{B} \sum_{b=1}^{B} p_{b,k} \right) \cdot \left( \frac{1}{B} \sum_{b=1}^{B} \text{softmax}(\mathbf{logits}_{b})_k \right), \label{eq:load_balance_loss} \end{equation} where $p_{b,k}$ is the Gumbel-Softmax output and $\text{softmax}(\mathbf{logits}_{b})_k$ is the standard softmax output of the router logits. \paratitle{Fusion of Modulations: } Once the routing weights $p_k$ are determined for each token, as described in Equation~\ref{eq:gumbel_softmax}, the final context-dependent modulation tensor, $\mathbf{A}_{Fusion} \in \mathbb{R}^{L \times d}$, is computed by combining the output of the Shared CAM module, $\mathbf{A}_{Shared}$, with the dynamically weighted sum of the modulations from all Specialized CAM modules, $\{\mathbf{A}_{Spec_k}\}_{k=1}^{N_s}$: \begin{equation} \mathbf{A}_{Fusion} = \mathbf{A}_{Shared} + \sum_{k=1}^{N_s} p_k \cdot \mathbf{A}_{Spec_k}, \label{eq:fusion_modulation} \end{equation} Here, $p_k$ denotes the token-specific routing weight of the $k$-th specialized module, ensuring that the context-based modulation of $\mathbf{A}_{Fusion}$ integrates both general and adaptively selected specialized knowledge. Finally, it is applied to the original self-attention output $h_{att}$, from Equation~\ref{eq:oriattn} in Section~\ref{subsec:camdetail}, to produce the HyCAM-enhanced output $h_{out}$ using the element-wise Hadamard product and residual connection, as defined in the core CAM mechanism: \begin{equation} h_{out} = h_{att} + h_{att} \odot \mathbf{A}_{Fusion}. \end{equation} This entire mechanism, from dynamic routing to the application of the fused modulation, allows HyCAM to dynamically modulate the self-attention process by integrating shared knowledge with specialized insights, thereby enabling the model to effectively balance generalization across diverse tasks with task-specific adaptation. \begin{table*}[t] \small \centering \caption{Datasets statistics.} \label{tab:dataset} \resizebox{0.85\linewidth}{!}{ \renewcommand{\arraystretch}{0.94} \begin{threeparttable}[b] \begin{tabular}{lccccc} \toprule Dataset & Samples & Total Tokens\tnote{1} & Avg. Tokens/Sample\tnote{1} & Domain & Source \\ \midrule Auto CoT & 5,816 & 943,474 & 162.22 & Arithmetic and other logical reasoning tasks & \cite{zhang2023automatic}\\ iCliniq & 7,321 & 1,826,306 & 249.46 & Conversations between patients and doctors & \cite{li2023chatdoctor}\\ Dolly 2.0 & 15,015 & 3,061,007 & 203.86 & Closed QA and summarization from Wikipedia & \cite{DatabricksBlog2023DollyV2}\\ CodeAlpaca & 20,222 & 2,195,523 & 109.66 & Code generation and optimization & \cite{codealpaca}\\ WebGPT & 18,994 & 13,988,895 & 736.49 & Information retrieval QA & \cite{nakano2021webgpt}\\ \bottomrule \end{tabular} \begin{tablenotes} \item[1] Calculated by Llama-3 Tokenizer. \end{tablenotes} \end{threeparttable} } \end{table*} \begin{table*}[t] \centering \caption{Experimental results across different backbone LLMs. \textbf{*}indicates the statistically significant improvements (\ie two-sided t-test with $p<0.05$) over the best PEFT baseline. Lower PPL$\downarrow$ is better, where higher BLEU$\uparrow$ and ROUGE$\uparrow$ reflect higher quality. The best results are bolded, while the second-best results are underlined. } \label{tab:exp1} \resizebox{0.95\linewidth}{!}{ \renewcommand{\arraystretch}{1} \begin{tabular}{l|ccc|ccc|ccc|ccc|ccc} \toprule Backbone LLM & \multicolumn{3}{c|}{Llama 2 7B}& \multicolumn{3}{c|}{Llama 3 8B} & \multicolumn{3}{c|}{Llama 3.1 8B} & \multicolumn{3}{c|}{Mistral 7B} & \multicolumn{3}{c}{Qwen 2.5 7B} \\ \midrule Metric & PPL$\downarrow$ & BLEU$\uparrow$ & ROUGE$\uparrow$ & PPL$\downarrow$ & BLEU$\uparrow$ & ROUGE$\uparrow$ & PPL$\downarrow$ & BLEU$\uparrow$ & ROUGE$\uparrow$& PPL$\downarrow$ & BLEU$\uparrow$ & ROUGE$\uparrow$& PPL$\downarrow$ & BLEU$\uparrow$ & ROUGE$\uparrow$\\ \midrule Full Finetune & 3.193 & \underline{0.171} & 0.231 & 3.978 & 0.151 & 0.203 & 3.873 & 0.153 & 0.205 & 4.403 & 0.157 & 0.192 & 3.024 & \underline{0.169} & 0.225 \\ LoRA & 3.222 & 0.157 & 0.225 & 3.556 & 0.148 & 0.24 & 3.537 & 0.156 & 0.237 & \underline{3.418} & \underline{0.163} & \underline{0.244} & 2.840 & 0.137 & \underline{0.239} \\ \midrule Multi LoRA & 3.287 & 0.121 & 0.217 & 3.547 & 0.157 & 0.236 & 3.653 & 0.134 & 0.235 & 3.461 & 0.141 & 0.225 & 3.069 & 0.136 & 0.222 \\ RieMoE-LoRA & \underline{3.171} & 0.154 & \underline{0.232} & \underline{3.497} & \underline{0.159} & \underline{0.242} & \underline{3.487} & \underline{0.161} & \underline{0.238} & 3.597 & 0.143 & 0.24 & \underline{2.830} & 0.157 & 0.227 \\ HyCAM & \textbf{3.081*} & \textbf{0.173*} & \textbf{0.244*} & \textbf{3.484*} & \textbf{0.162*} & \textbf{0.245*} & \textbf{3.453*} & \textbf{0.172*} & \textbf{0.251*} & \textbf{3.299*} & \textbf{0.171*} & \textbf{0.249*} & \textbf{2.757*} & \textbf{0.172*} & \textbf{0.248*} \\ \bottomrule \end{tabular} } \end{table*} \subsection{Training Details} The HyCAM framework, including the Shared CAM module, the Specialized CAM modules, and the dynamic router, is trained end-to-end. We use a composite objective function that combines a primary task-specific loss with the auxiliary load-balancing loss, described in Section~\ref{sec:routing}. This approach ensures that the model not only learns to perform the target tasks effectively but also maintains balanced utilization of its specialized components, leading to efficient adaptation across diverse tasks and enhanced overall multi-task performance. \paratitle{Task-specific Loss: } We employ a standard autoregressive training strategy common for LLMs, as introduced in Section~\ref{sec:finetune}, where the model is trained to predict the next token in a sequence given the input context. Given an input sequence $\mathbf{X} = (x_1, x_2, \dots, x_m)$ and its corresponding target sequence $\mathbf{Y} = (y_1, y_2, \dots, y_n)$, the model is trained to predict each token $y_t$ conditioned on the input $\mathbf{X}$ and the previous target tokens $\mathbf{Y}_{