Graduate/mypaper/IJCAI2026_CASCADE.tex

\title{Coarse-to-Fine Spectral Cascading for Parameter-Efficient LLM Adaptation}

\begin{document}

\maketitle
\begin{abstract}
Parameter-efficient fine-tuning is widely used for adapting large language models, and recent methods have explored frequency-domain parameterizations as a promising alternative to low-rank update assumptions.
However, most existing methods rely on a single structural assumption and treat different frequency components independently, limiting their ability to jointly model global adaptations, localized refinements, and dependencies.
To address these limitations, we propose CASCADE, a coarse-to-fine spectral cascading framework for parameter-efficient LLM adaptation. Specifically, CASCADE models weight updates using heterogeneous experts across frequency and spatial domains, and explicitly coordinates global and local updates through cascaded spectral modulation and adaptive routing.
This design enables coherent integration of global structural adjustments with localized refinements, resulting in more effective and robust adaptation.
Extensive experiments across multiple benchmarks demonstrate that CASCADE consistently outperforms strong PEFT baselines.
\end{abstract}

\section{Introduction}


Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language understanding, reasoning, and generation tasks, and have become a fundamental component in various real-world applications.
Despite their strong generalization ability, adapting pretrained LLMs to specific downstream tasks typically still requires task-specific fine-tuning.
However, full-parameter fine-tuning incurs substantial computational and storage costs, which limits its practicality, particularly when a single model must be adapted to multiple domains.

To address this challenge, parameter-efficient fine-tuning (PEFT) methods have been extensively studied.
Instead of updating all model parameters, PEFT methods restrict adaptation to a small set of trainable parameters while keeping the pretrained backbone frozen.
Representative approaches such as Low-Rank Adaptation (LoRA)~\cite{hu2021lora} assume that weight updates lie in a low-dimensional subspace, achieving strong parameter efficiency.
More recently, frequency-domain PEFT methods have been proposed, which parameterize weight updates in transformed domains such as Fourier or wavelet bases, exploiting the spectral structure of adaptation patterns~\cite{gao2024parameter,hu2025waveletft}.
These methods have demonstrated promising efficiency-performance trade-offs across a variety of tasks.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.9\linewidth]{assets/influence_comparisonv3.pdf}
    \caption{Spectral characteristics of weight updates under full fine-tuning. High-frequency components dominate the spectral energy of weight updates, whereas low-frequency components, despite their low energy, affect a substantially larger portion of the weight matrix. This pattern is consistent across layers and modules, highlighting the distinct global and local roles of different frequency components.}
    \label{fig:spectral}
\end{figure}

Despite this progress, existing PEFT methods still face two key limitations:

\paragraph{Challenge 1: Single-structure limitation.}
Most PEFT approaches rely on a single structural constraint, such as restricting weight updates to a low-rank subspace or parameterizing them with a fixed frequency-domain basis.
However, a single structure is insufficient to capture the heterogeneous nature of weight updates in LLM adaptation, which typically involves both global semantic or reasoning behavior and localized, fine-grained corrections.
\paragraph{Challenge 2: Cross-frequency dependency.}
Recent frequency-aware PEFT methods incorporate  frequency or scale information into weight updates, but typically treat different frequency components as independent.
In practice, effective adaptation requires alignment between local high-frequency refinements and global low-frequency updates. Ignoring such dependencies can lead to suboptimal or inefficient adaptations.


Fig.~\ref{fig:spectral} reveals a notable mismatch between spectral energy and spatial influence in weight updates.
High-frequency components dominate the spectral energy, yet their impact is often confined to a limited subset of parameters.
In contrast, low-frequency components contribute relatively small spectral energy but influence a substantially larger portion of the weight matrix.
The large spatial coverage with low spectral energy corresponds to smooth and coherent changes distributed across many parameters, characteristic of global structural adaptation.
Conversely, high spectral energy concentrated on a limited subset of parameters corresponds to sparse and localized modifications.
This contrast reveals that low-frequency components establish a global adaptation structure, while high-frequency components refine specific regions on top of this structure, forming a coarse-to-fine adaptation pattern.
These observations suggest that effective adaptation requires modeling heterogeneous frequency components with distinct roles and capturing the dependency between global and local updates.

Motivated by this insight, we propose \textbf{CASCADE} (Coarse-to-Fine Spectral Cascading), a parameter-efficient fine-tuning framework that explicitly models heterogeneous frequency components of weight updates and their dependencies.
CASCADE adopts a heterogeneous mixture-of-experts architecture, in which complementary experts are designed to capture different structural roles of weight updates.
Specifically, we employ \textit{(i) a low-frequency expert} based on the Discrete Cosine Transform (DCT) to capture global and smooth structural adjustments across the weight matrix; \textit{(ii) a high-frequency expert} operating on wavelet detail subbands to model localized refinements corresponding to fine-grained corrections; and \textit{(iii) a spatial residual expert} in the original parameter space to handle update patterns that are difficult to represent in the frequency domain.

Crucially, CASCADE goes beyond treating these components as independent. We introduce a cascaded spectral modulation mechanism that establishes an explicit coarse-to-fine dependency, in which low-frequency updates provide a global adaptation structure that conditions the generation of high-frequency refinements.
This design enforces alignment between global and local updates, ensuring that localized corrections remain consistent with the overall adaptation direction.
In addition, we incorporate a spectral complexity-aware routing mechanism that dynamically weights different experts based on input characteristics, enabling flexible and context-sensitive adaptation.

Together, these designs enable CASCADE to overcome the limitations of single-structure modeling and independent frequency components in existing PEFT methods, providing a unified framework that coherently captures both global and local updates. Our contributions are summarized as follows:
\begin{itemize}[leftmargin=*, topsep=0pt]
    \item To our knowledge, CASCADE is the first PEFT framework that models LLM weight updates using heterogeneous experts across frequency and spatial domains, enabling a unified representation of global and localized refinements.

    \item We introduce a cascaded spectral modulation mechanism that establishes coarse-to-fine dependencies between low- and high-frequency updates, together with a spectral complexity-aware routing strategy for adaptive expert combination.

    \item Extensive experiments on fifteen public benchmarks, using three backbone models and covering commonsense and arithmetic tasks, demonstrate that CASCADE consistently outperforms existing mainstream PEFT methods, validating its effectiveness across diverse adaptation scenarios.

\end{itemize}


\begin{figure*}[ht]
    \centering
    \includegraphics[width=0.8\linewidth]{assets/model2.pdf}
    \caption{Overview of CASCADE.
CASCADE adapts frozen backbone modules via heterogeneous frequency-domain and spatial-domain experts, coordinated by cascaded modulation and dynamically combined through spectral complexity-aware routing.
}
    \label{fig:framework}
\end{figure*}

\section{Preliminaries}
\label{sec:pre}

In this section, we briefly introduce the key formulations and perspectives that will be used throughout the paper, in order to facilitate the presentation of our method in Section~3.

\subsection{Parameter-Efficient Fine-Tuning}
Parameter-efficient fine-tuning (PEFT) aims to adapt a pretrained model to downstream tasks by learning a small number of trainable parameters, while keeping the original pretrained weights frozen.

Let $\mathbf{W}_0 \in \mathbb{R}^{m \times n}$ denote a pretrained weight matrix of a linear transformation, and $\mathbf{x} \in \mathbb{R}^n$ be the corresponding input.
Instead of directly updating $\mathbf{W}_0$, PEFT methods introduce an additive weight update $\Delta \mathbf{W}$, yielding the adapted transformation:
\begin{equation}
\mathbf{y} = (\mathbf{W}_0 + \Delta \mathbf{W}) \mathbf{x}.
\label{eq:peft}
\end{equation}

The core principle of PEFT is to impose structural constraints on $\Delta \mathbf{W}$ to significantly reduce the adaptation cost.
Common constraints include low-rank factorization and sparsity assumptions, while more recent approaches have also explored structured parameterizations in transformed domains.
Under this formulation, different PEFT methods can be viewed as imposing distinct structural assumptions on $\Delta \mathbf{W}$, which, in turn, determine the types of update patterns they are capable of representing.
In practice, most existing PEFT methods adopt a single structural assumption throughout adaptation.
This observation motivates exploring complementary structural assumptions that can capture heterogeneous update patterns from different perspectives.

\subsection{Frequency-Domain View of Weight Updates}

Under the formulation in Eq.~\eqref{eq:peft}, the weight update $\Delta \mathbf{W} \in \mathbb{R}^{m \times n}$ can be interpreted as a two-dimensional signal defined over the parameter indices.
From this perspective, it is natural to analyze $\Delta \mathbf{W}$ in the frequency domain by applying an appropriate linear transform, which decomposes the update into components associated with different spatial frequencies.

In general, low-frequency components correspond to smooth, slowly varying patterns spanning large regions of the weight matrix, while high-frequency components capture rapid variations localized to specific parameter regions.
These components reflect distinct structural characteristics of weight updates, ranging from global, coherent adjustments to localized, fine-grained modifications.
Frequency-domain analysis thus provides a complementary view to spatial-domain parameterizations, emphasizing the scale and distribution of variations rather than explicit locations.

Different frequency-domain transforms, such as Fourier or wavelet transforms, offer alternative bases for representing $\Delta \mathbf{W}$, each inducing different biases with respect to globality, locality, and multi-scale structure.
Representing weight updates in frequency-domain bases allows different structural constraints to be applied to components at different frequencies, enabling finer control over the resulting update patterns.

Importantly, this frequency-domain perspective does not assume a specific decomposition strategy but provides a unified view for characterizing heterogeneous structures in weight updates.
However, existing PEFT methods typically adopt a single structural assumption for adaptation, limiting their ability to jointly capture global and localized updates.

\section{Method}

\subsection{Overview}

Existing PEFT methods typically rely on a single structural assumption and treat different components of weight updates as independent, which limits their ability to model heterogeneous and interdependent adaptation behaviors in LLMs.
In practice, effective adaptation often involves both global structural adjustments and localized refinements, which are difficult to capture under independent modeling assumptions.
From a spectral perspective, weight updates can be decomposed into components with distinct functional roles.
Low-frequency components generally capture smooth, global adjustments that shape the overall adaptation structure, whereas high-frequency components correspond to localized and fine-grained modifications.
In addition, these components are inherently coupled, forming a coarse-to-fine adaptation pattern in which local refinements are guided by a global structure.


Motivated by this observation, we propose \textbf{CASCADE} (Coarse-to-Fine Spectral Cascading), a PEFT framework that explicitly models heterogeneous update components together with their coarse-to-fine dependencies.
As illustrated in Fig.~\ref{fig:framework}, CASCADE adopts a heterogeneous mixture-of-experts design with a frozen backbone.
It introduces three complementary experts:
(i) a low-frequency expert operating in the Discrete Cosine Transform (DCT) domain to capture global and smooth updates,
(ii) a high-frequency expert that models wavelet detail components to represent localized refinements, and
(iii) a spatial residual expert in the original parameter space to handle update patterns that are not well represented in the frequency domain.

CASCADE further incorporates a cascaded spectral modulation mechanism, in which low-frequency updates condition high-frequency refinements to enforce consistency between global and local adaptations.
In addition, a spectral complexity-aware routing module dynamically combines the outputs of different experts based on input characteristics.

By jointly modeling heterogeneous updates and dependencies, CASCADE provides a unified framework for efficiently capturing both global and local adaptation refinements.

\subsection{Problem Formulation}

Under the standard PEFT setting introduced in Section~\ref{sec:pre}, CASCADE adapts a frozen pretrained weight matrix $\mathbf{W}_0$ by learning a structured weight update.
Specifically, CASCADE represents the update as a combination of $E$ complementary experts, each producing a structured update $\Delta \mathbf{W}_e$ that captures a distinct type of adaptation pattern.
Given an input $\mathbf{x}$, the adapted output is obtained by aggregating expert-specific updates using input-dependent routing weights:
\begin{equation}
\mathbf{y} = \mathbf{W}_0 \mathbf{x} + \sum_{e=1}^{E} w_e(\mathbf{x}) \cdot \Delta \mathbf{W}_e \mathbf{x},
\label{eq:cascade_formulation}
\end{equation}
where $w_e(\mathbf{x})$ denotes the routing weight assigned to the $e$-th expert.
The specific parameterizations of $\Delta \mathbf{W}_e$, the mechanisms for modeling inter-expert dependencies, and the routing strategy are described in the following subsections.

\subsection{Heterogeneous Domain Experts}
As discussed, weight updates in LLM adaptation exhibit heterogeneous structural characteristics.
To capture this heterogeneity, CASCADE introduces domain-specific experts that impose distinct inductive biases through different parameterizations of $\Delta \mathbf{W}_e$.
This design enables complementary modeling of diverse update patterns, alleviating the limitations imposed by a single structural assumption.


\subsubsection{Low-Frequency Expert via Discrete Cosine Transform}
The low-frequency expert is designed to capture global and smooth update patterns that span large regions of the weight matrix.
Such patterns commonly arise from semantic alignment or global reasoning adjustments and are inefficient to represent using localized or sparse parameterizations.

To introduce a global smoothness prior, we parameterize the update in the Discrete Cosine Transform (DCT) domain.
Let $\mathbf{S}_{\text{dct}} \in \mathbb{R}^{m \times n}$ denote a DCT-domain coefficient matrix, with the same dimensions as the corresponding weight matrix.
We restrict learning to a predefined low-frequency index set $\mathcal{I}_{\text{dct}}$ and define the sparse spectral parameterization as
\begin{equation}
\mathbf{S}_{\text{dct}}[i,j] =
\begin{cases}
s_k, & (i,j) \in \mathcal{I}_{\text{dct}}, \\
0, & \text{otherwise},
\end{cases}
\label{eq:dct_sparse}
\end{equation}
where $\{s_k\}_{k=1}^{K_{\text{dct}}}$ are trainable parameters associated with fixed low-frequency locations $(i_k,j_k)\in\mathcal{I}_{\text{dct}}$.
The index set $\mathcal{I}_{\text{dct}}$ is obtained by selecting the $K_{\text{dct}}$ locations with the smallest Manhattan distance to the zero-frequency index (0, 0), thereby favoring slowly varying spatial patterns.


The corresponding spatial-domain update produced by the low-frequency expert is obtained via the inverse DCT:
\begin{equation}
\Delta \mathbf{W}_{\text{dct}} = \mathrm{IDCT}(\mathbf{S}_{\text{dct}}).
\label{eq:dct_inverse}
\end{equation}

By restricting learning to low-frequency coefficients, this expert enforces a global smoothness prior on $\Delta \mathbf{W}_{\text{dct}}$, enabling efficient modeling of large-scale structural adjustments with a compact parameterization.
As such, it serves as the global backbone of the adaptation.

\subsubsection{High-Frequency Expert via Wavelet Details}


While the low-frequency expert captures global structure, effective adaptation also requires localized, fine-grained high-frequency corrections that global frequency fails to capture effectively.
To model such patterns, the high-frequency expert parameterizes updates in the wavelet domain, which provides localization in both spatial and frequency domains.

We adopt a single-level two-dimensional Haar wavelet basis, which defines four wavelet subbands: one low-frequency approximation subband ($\mathbf{LL}$) and three detail subbands ($\mathbf{LH}$, $\mathbf{HL}$, and $\mathbf{HH}$) corresponding to high-frequency components along different spatial directions.
To focus on localized refinements, we discard the approximation component and parameterize only the detail subbands.

Let $\mathcal{B}=\{\mathrm{LH},\mathrm{HL},\mathrm{HH}\}$ denote the set of detail subbands.
For each $b\in\mathcal{B}$, we learn a sparse coefficient matrix $\mathbf{B}_b$ defined on a fixed index set $\mathcal{I}_b$, which is randomly sampled once and kept constant during training:
\begin{equation}
\mathbf{B}_b[i,j] =
\begin{cases}
s^{(b)}_k, & (i,j)\in\mathcal{I}_b, \\
0, & \text{otherwise}.
\end{cases}
\label{eq:wavelet_sparse}
\end{equation}

The spatial-domain update is reconstructed via the inverse Haar transform from the detail subband coefficients:
\begin{equation}
\Delta \mathbf{W}_{\text{wav}} =
\mathrm{IHaar}\!\left(
\mathbf{0},\,
\mathbf{B}_{\mathrm{LH}},\,
\mathbf{B}_{\mathrm{HL}},\,
\mathbf{B}_{\mathrm{HH}}
\right).
\label{eq:wavelet_inverse}
\end{equation}

By restricting learning to sparse detail coefficients, the wavelet expert provides a dedicated mechanism for fine-grained corrections that complements the global updates modeled in the DCT expert, naturally motivating explicit coarse-to-fine coordination across frequency components.

\subsubsection{Spatial Residual Expert}

Although frequency-domain parameterizations impose useful structural priors, they may fail to capture certain irregular update patterns that are not well represented by predefined spectral bases.
To account for such out-of-basis effects, CASCADE includes a lightweight spatial residual expert that directly operates in the original parameter space by parameterizing a residual update using a low-rank factorization:
\begin{equation}
\Delta \mathbf{W}_{\text{spatial}} = \mathbf{B}\mathbf{A},
\label{eq:spatial_update}
\end{equation}
where $\mathbf{A}\in\mathbb{R}^{r\times n}$ and $\mathbf{B}\in\mathbb{R}^{m\times r}$ with a small rank $r$.
This formulation provides flexible capacity for modeling update patterns that are difficult to express in the frequency domain.

The spatial expert serves as a residual component for out-of-basis corrections, allowing frequency-domain experts to focus on structured global and local patterns while improving robustness and expressive completeness.


\subsection{Cascaded Spectral Modulation}

The heterogeneous experts introduced above capture complementary aspects of weight updates.
However, treating global and local updates as independent components ignores their inherent dependency, as localized refinements in practice are often guided by a global structure.
To explicitly model this coarse-to-fine relationship, CASCADE introduces a cascaded spectral modulation mechanism that enforces consistency between low-frequency structure and high-frequency updates.
Specifically, we construct a fixed-dimensional conditioning vector $\mathbf{z}$ by flattening the learned low-frequency DCT coefficients. This vector summarizes the global adaptation pattern and is used as the input to a conditioning network:

\begin{equation}
(\gamma_b, \beta_b)_{b\in\mathcal{B}} = g(\mathbf{z}),
\label{eq:film_params}
\end{equation}
where $g(\cdot)$ denotes a lightweight multilayer perceptron that outputs band-wise scalar modulation parameters, and $\mathcal{B}=\{\mathrm{LH},\mathrm{HL},\mathrm{HH}\}$ indexes the wavelet detail subbands, to which the modulation is applied:
\begin{equation}
\tilde{\mathbf{B}}_b = (1 + \gamma_b)\,\mathbf{B}_b + \beta_b,
\quad b\in\mathcal{B},
\label{eq:bandwise_film}
\end{equation}
where $\gamma_b$ and $\beta_b$ are scalar parameters shared across all locations within subband $b$. The modulation is applied only to the sampled coefficient locations in $\mathcal{I}_b$, with all other entries remaining zero, and the resulting coefficients are used to reconstruct the high-frequency update via Eq.~\eqref{eq:wavelet_inverse}.

This design establishes an explicit coarse-to-fine dependency, allowing global low-frequency structure to guide localized refinements and yield more coherent weight updates.


\subsection{Spectral Complexity-Aware Routing}

While cascaded spectral modulation defines how different update components are coupled, the relative importance of these components can vary across inputs.
Some inputs primarily require global structural adaptation, while others benefit more from localized or residual corrections.
To account for this variability, CASCADE employs a spectral complexity-aware routing mechanism that dynamically combines expert outputs based on input characteristics.

Given the input activation to a linear layer, we obtain a sequence-level representation $\bar{\mathbf{x}}$ via pooling.
From this representation, we extract two complementary types of routing features.
First, lightweight spectral statistics are computed to characterize the degree of variation and oscillation in the input, forming a spectral feature vector $\bar{\mathbf{x}}_{\text{spec}}$.
Second, a semantic feature is obtained through a learnable linear projection of $\bar{\mathbf{x}}$ to provide higher-level contextual information.
The two complementary features are fused through linear projections:
\begin{equation}
\mathbf{h} = \mathbf{W}_{\text{spec}} \bar{\mathbf{x}}_{\text{spec}} + \mathbf{W}_{\text{sem}} \bar{\mathbf{x}},
\label{eq:feature_fusion}
\end{equation}
and mapped to expert weights via a softmax:
\begin{equation}
\mathbf{w} = \mathrm{softmax}(\mathbf{W}_{\text{out}} \mathbf{h}),
\label{eq:routing_weights}
\end{equation}
where $\mathbf{w}\in\mathbb{R}^{E}$ assigns a non-negative weight to each expert.

By leveraging coarse spectral cues and semantic context, the routing mechanism adaptively weights expert contributions while preserving efficient and stable soft combination.

\subsection{Training Details}

CASCADE is trained end-to-end under the standard supervised objective for the downstream task, while keeping the backbone frozen.
The overall training objective consists of the task loss and two auxiliary regularization terms:
\begin{equation}
\mathcal{L}
=
\mathcal{L}_{\text{task}}
+
\lambda_{\text{bal}} \mathcal{L}_{\text{bal}}
+
\lambda_{\text{orth}} \mathcal{L}_{\text{orth}},
\label{eq:training_objective}
\end{equation}
where $\lambda_{\text{bal}}$ and $\lambda_{\text{orth}}$ control the strength of the regularizers.

\paragraph{Routing Regularization.}
To prevent degenerate routing solutions, we introduce a load-balancing regularization, which is defined as
\begin{equation}
\mathcal{L}_{\text{bal}}
=
E \sum_{e=1}^{E}
\left(
\frac{1}{B} \sum_{b=1}^{B} w_e^{(b)}
\right)^2,
\label{eq:load_balance}
\end{equation}
where $w_e^{(b)}$ denotes the routing weight of expert $e$ for the $b$-th sample, and $E$ is the number of experts.

\paragraph{Spectral Orthogonality.}
To reduce redundancy between frequency-domain experts, we impose an orthogonality regularization on their spectral parameters.
Specifically, we penalize the inner product between the low-frequency spectral coefficients and high-frequency wavelet detail coefficients.
Both spectral representations are first mapped to a common latent space with matched dimensionality.
\begin{equation}
 \mathcal{L}_{\text{orth}}
  =
  \left|
  \left\langle
  \mathrm{vec}(\mathbf{S}_{\text{dct}}),
  \mathrm{vec}([\mathbf{B}_{\mathrm{LH}}, \mathbf{B}_{\mathrm{HL}}, \mathbf{B}_{\mathrm{HH}}])
  \right\rangle
  \right|.
\label{eq:orth_loss}
\end{equation}
This regularization encourages the two experts to capture complementary spectral patterns.


Algorithm~\ref{alg:cascade} summarizes the overall procedure of CASCADE, including expert-specific update construction, cascaded spectral modulation, and expert routing.

\begin{algorithm}[t]
\caption{CASCADE: Coarse-to-Fine Spectral Cascading}
\label{alg:cascade}
\KwIn{Input activation $\mathbf{x}$, frozen weight matrix $\mathbf{W}_0$}
\KwOut{Adapted output $\mathbf{y}$}

Compute base output $\mathbf{y}_0 \leftarrow \mathbf{W}_0 \mathbf{x}$ \\

\textbf{Low-frequency expert:} \\
Construct sparse DCT spectrum $\mathbf{S}_{\text{dct}}$ using Eq.~\eqref{eq:dct_sparse} \\
Reconstruct global update $\Delta \mathbf{W}_{\text{dct}}$ using Eq.~\eqref{eq:dct_inverse} \\

\textbf{High-frequency expert:} \\
Construct sparse wavelet detail coefficients $\{\mathbf{B}_b\}_{b\in\mathcal{B}}$ using Eq.~\eqref{eq:wavelet_sparse} \\
Compute modulation parameters $(\gamma_b,\beta_b)_{b\in\mathcal{B}}$ using Eq.~\eqref{eq:film_params} \\
Apply band-wise modulation $\tilde{\mathbf{B}}_b$ using Eq.~\eqref{eq:bandwise_film} \\
Reconstruct local update $\Delta \mathbf{W}_{\text{wav}}$ using Eq.~\eqref{eq:wavelet_inverse} \\

\textbf{Spatial residual expert:} \\
Compute residual update $\Delta \mathbf{W}_{\text{spatial}}$ using Eq.~\eqref{eq:spatial_update} \\

\textbf{Routing and aggregation:} \\
Compute expert weights $\mathbf{w}$ using Eq.~\eqref{eq:routing_weights} \\
Compute aggregated update $\Delta \mathbf{W} \leftarrow \sum_{e=1}^{E} w_e \cdot \Delta \mathbf{W}_e$ \\
Return $\mathbf{y} \leftarrow \mathbf{y}_0 + \Delta \mathbf{W}\mathbf{x}$
\end{algorithm}


\begin{table*}[t]
    \centering
    \small
    \caption{Comparison of CASCADE and baselines on Commonsense tasks across three backbones, reported in accuracy (\%), with micro-avg denoting the average performance.
    $\ ^{*}$ indicates statistically significant improvements over the best baseline (two-sided t-test, $p<0.05$).}
    \resizebox{1\linewidth}{!}{
        \renewcommand{\arraystretch}{1.05}
        \begin{tabular}{l|lccccccccc}
            \toprule
            \textbf{Backbone LLM} & \textbf{Method}
            & \textbf{BoolQ} & \textbf{PIQA} & \textbf{SIQA}
            & \textbf{ARC-C} & \textbf{ARC-E} & \textbf{OBQA}
            & \textbf{HellaSwag} & \textbf{WinoGrande}
            & \textbf{micro-avg(\%)$\uparrow$} \\
            \midrule
            \multirow{7}{*}{\textbf{Qwen 3 4B}}
                & LoRA      &66.88&82.97&\underline{73.59}&86.86&92.21&\underline{83.60}&85.37&\underline{68.75}&81.27\\
                & AdaLoRA   &\underline{67.34}&82.64&73.44&87.03&92.89&82.00&79.99&67.88&78.89\\
                & BONE      &66.15&81.61&72.62&85.24&92.55&75.40&78.85&68.11&77.78\\
                & FourierFT &66.57&80.30&73.54&86.01&92.09&82.40&79.59&63.14&78.01\\
                & LoCA      &66.85&83.03&72.67&86.95&\underline{93.27}&80.60&84.33&66.69&80.66\\
                & FlyLoRA   &66.51&\underline{83.35}&73.54&\underline{87.20}&93.06&78.20&\underline{85.63}&68.35&\underline{81.33}\\
                & \framework (ours)
                             &\textbf{67.74}&\textbf{83.46}&\textbf{75.49}&\textbf{87.88}
                             &\textbf{93.64}&\textbf{86.40}&\textbf{85.75}&\textbf{71.98}
                             &\textbf{82.22}$^{*}$\\
            \midrule
            \multirow{7}{*}{\textbf{LLaMA 3.2 3B}}
                & LoRA      &61.41&78.62&66.79&68.26&84.05&70.20&79.49&\underline{56.35}&\underline{74.05}\\
                & AdaLoRA   &\underline{61.53}&78.89&67.04&\underline{69.71}&83.63&69.60&79.31&54.78&73.96\\
                & BONE      &60.61&76.17&66.53&67.24&79.88&63.20&79.28&50.04&72.61\\
                & FourierFT &60.92&\underline{80.30}&59.47&67.75&82.45&66.40&79.05&50.67&72.68\\
                & LoCA      &61.07&78.51&64.12&66.47&82.37&67.20&77.07&55.88&72.31\\
                & FlyLoRA   &59.02&78.94&\underline{67.14}&67.58&\underline{84.22}&\underline{71.80}&\underline{79.66}&52.49&73.64\\
                & \framework (ours)
                             &\textbf{62.66}&\textbf{80.69}&\textbf{67.40}&\textbf{69.97}
                             &\textbf{84.68}&\textbf{73.60}&\textbf{79.94}&\textbf{62.59}
                             &\textbf{75.25}$^{*}$\\
            \midrule
            \multirow{7}{*}{\textbf{Gemma 3 4B}}
                & LoRA      &64.34&78.07&\underline{70.21}&75.26&\underline{87.37}&75.60&\underline{77.97}&\underline{61.88}&\underline{75.21}\\
                & AdaLoRA   &\underline{64.86}&\underline{79.16}&69.91&75.68&86.87&72.00&77.19&61.17&74.84\\
                & BONE      &63.67&78.35&69.19&\underline{76.11}&86.95&70.60&73.97&48.22&72.37\\
                & FourierFT &64.22&77.42&68.68&74.32&87.33&72.00&74.49&50.75&72.68\\
                & LoCA      &63.52&76.82&68.47&73.29&85.98&68.20&75.06&49.01&72.39\\
                & FlyLoRA   &61.59&76.12&67.45&75.34&86.53&\underline{77.60}&77.88&58.72&74.15\\
                & \framework (ours)
                             &\textbf{65.81}&\textbf{80.36}&\textbf{73.39}&\textbf{77.39}
                             &\textbf{88.97}&\textbf{79.00}&\textbf{78.47}&\textbf{64.09}
                             &\textbf{76.59}$^{*}$\\
            \bottomrule
        \end{tabular}
    }
    \label{tab:main_common}
    \vspace{-4px}
\end{table*}

\begin{table}[t]
    \centering
    \small
    \caption{Average Commonsense QA accuracy across Qwen-3 model scales, comparing CASCADE with best PEFT baselines.}
    \resizebox{1\linewidth}{!}{
        \renewcommand{\arraystretch}{1}
        \begin{tabular}{lccc}
            \toprule
            \textbf{Baseline} & \textbf{Qwen 3 0.6B} & \textbf{Qwen 3 1.7B} & \textbf{Qwen 3 4B}  \\
            \midrule
            LoRA            &\underline{57.50}&\underline{66.25}&81.27 \\
            AdaLoRA         &56.50&64.37&78.89 \\
            FlyLoRA         &54.37&62.12&\underline{81.33} \\
            \framework (ours)
                            &\textbf{58.07}&\textbf{66.75}&\textbf{82.22} \\
            \bottomrule
        \end{tabular}
    }
    \label{tab:scale}
    \vspace{-9px}
\end{table}

\section{Experiments}
To comprehensively evaluate the performance of our proposed CASCADE, we conduct extensive experiments guided by the following key research questions (RQs):

\begin{itemize}[leftmargin=*]
    \item \textbf{RQ1:}
    How does CASCADE compare with representative PEFT baselines across commonsense and arithmetic tasks?

    \item \textbf{RQ2:}
How does CASCADE scale across different parameter sizes within the same LLM family?

    \item \textbf{RQ3:}
How do individual design components contribute to the performance of CASCADE?
    \item \textbf{RQ4:}
How do contributions from different frequency experts vary across layers under the routing mechanism?
\end{itemize}


We first introduce the experimental setup and then systematically address each of the above research questions.


\subsection{Experimental Setup}
\paragraph{Datasets.}
Following the setup of LLM-Adapters~\cite{hu2023llm}, we evaluate CASCADE on \textit{Commonsense} and \textit{Arithmetic QA} tasks, using the \textit{Commonsense15K} and \textit{Math10K} datasets constructed from multiple data sources.
Commonsense performance is evaluated on eight benchmarks: BoolQ~\cite{clark2019boolq}, PIQA~\cite{bisk2020piqa}, SIQA~\cite{sap2019socialiqa}, ARC-Easy\&Challenge~\cite{clark2018think}, OBQA~\cite{mihaylov2018can}, HellaSwag~\cite{zellers2019hellaswag}, and WinoGrande~\cite{sakaguchi2020winogrande},
while Arithmetic performance is assessed on seven benchmarks: MultiArith~\cite{roy2016solving}, GSM8K~\cite{cobbe2021training}, AddSub~\cite{hosseini2014learning}, AQuA~\cite{ling2017program}, SingleEq~\cite{koncel2015parsing}, SVAMP~\cite{patel2021nlp}, and MAWPS~\cite{koncel2016mawps}.
Accuracy is reported as the evaluation metric, with additional details provided in the Appendix.

\paragraph{Backbone Models.}
We evaluate our method on three representative pre-trained LLM backbones: \textbf{Qwen3}~\cite{qwen3technicalreport}, \textbf{Gemma 3}~\cite{gemma_2025}, and \textbf{LLaMA 3.2}~\cite{grattafiori2024llama}.
These models span diverse architectures, enabling a comprehensive evaluation.
\paragraph{Baseline Methods.}
We compare our method with a diverse set of PEFT approaches spanning \textbf{low-rank adaptation} (LoRA~\cite{hu2021lora}, AdaLoRA~\cite{zhang2023adalora}, BONE~\cite{kang2024balancing}), \textbf{frequency-domain modeling} (FourierFT~\cite{gao2024parameter}, LoCA~\cite{du2025loca}), and \textbf{MoE-based} designs (FlyLoRA~\cite{zou2025flylora}). All methods are implemented following their original settings.


\paragraph{Implementation Details.}
All experiments are conducted on NVIDIA GeForce RTX 3090, using bfloat16 with DeepSpeed for efficient training.
Key hyperparameters of CASCADE include 20K low-frequency DCT coefficients, 10K wavelet coefficients, a spatial residual expert with rank 48, and load-balancing as well as orthogonality loss weights set to 0.01.
For detailed implementation, please refer to the Appendix and our code for reproducibility\footnote{\codelink}. %


\begin{table*}[t]
    \centering
    \small
    \caption{Comparison of CASCADE and representative PEFT baselines on arithmetic reasoning benchmarks with the Qwen3-4B model, reported in accuracy (\%).
$\ ^{*}$ indicates statistically significant improvements over the best baseline (two-sided t-test, $p<0.05$).}
    \resizebox{0.94\linewidth}{!}{
        \renewcommand{\arraystretch}{0.95}
        \begin{tabular}{lcccccccc}
            \toprule
            \textbf{Baseline}
            & \textbf{MultiArith} & \textbf{GSM8K} & \textbf{AddSub}
            & \textbf{AQuA} & \textbf{SingleEq}
            & \textbf{SVAMP} & \textbf{MAWPS}
            & \textbf{micro-avg(\%)$\uparrow$} \\
            \midrule
            LoRA        &\underline{77.50}&\underline{36.16}&\underline{83.80}&26.77&85.83&55.90&\underline{79.41}&\underline{58.53}\\
            AdaLoRA     &\underline{80.50}&33.81&75.95&22.83&74.41&48.80&74.37&54.01\\
            BONE        &79.50&31.69&78.99&\underline{27.17}&80.71&50.30&76.05&54.94\\
            FourierFT   &68.67&31.08&76.46&23.62&78.54&\underline{57.30}&74.34&54.02\\
            LoCA        &73.33&30.63&72.15&21.65&75.98&48.30&69.33&51.41\\
            FlyLoRA     &79.67&35.33&81.52&22.83&\underline{86.42}&56.20&73.11&57.93\\
            \framework (ours)
                        &\textbf{81.33}&\textbf{37.00}&\textbf{86.08}&\textbf{27.56}
                        &\textbf{87.60}&\textbf{57.90}&\textbf{80.25}&\textbf{60.29}$^{*}$\\
            \bottomrule
        \end{tabular}
    }
    \label{tab:main_arith}
    \vspace{-4px}
\end{table*}

\subsection{Overall Performance (RQ1)}
To answer RQ1, we compare CASCADE with baselines on two categories of tasks: commonsense and arithmetic QA.

As shown in Table~\ref{tab:main_common}, CASCADE consistently achieves the best micro-averaged accuracy across all three backbone models.
Compared with strong baselines such as LoRA, AdaLoRA, and recent frequency-domain methods, CASCADE yields consistent and statistically significant improvements, demonstrating robust performance across different architectures and commonsense benchmarks.
These results indicate that jointly modeling heterogeneous update components is more effective than relying on a single structural assumption.
By capturing both global low-frequency structure and localized high-frequency refinements, CASCADE better adapts to diverse commonsense reasoning patterns.

We further evaluate CASCADE on arithmetic reasoning benchmarks using the Qwen3-4B backbone, with results reported in Table~\ref{tab:main_arith}.
Consistent with the observations on commonsense tasks, CASCADE achieves the highest overall performance, outperforming all baselines in terms of micro-averaged accuracy.
Together, these results demonstrate that CASCADE provides a more effective PEFT strategy across both commonsense and arithmetic reasoning tasks.

\subsection{Scalability Analysis (RQ2)}
Table~\ref{tab:scale} reports the performance of CASCADE across different parameter scales within the Qwen-3 family.
CASCADE consistently outperforms the strongest PEFT baselines at all model sizes, from 0.6B to 4B parameters.
Notably, the performance advantage remains stable as model scale increases, indicating that CASCADE scales favorably with model capacity.
This trend validates the effectiveness of CASCADE's design, demonstrating that explicitly modeling heterogeneous update components and their coarse-to-fine coordination remains robust across different model scales within the same LLM family.


\subsection{Ablation and Analysis (RQ3, 4)}
Fig.~\ref{fig:abla} (left) reports ablation results by removing key components of CASCADE.
Removing either the DCT or Wavelet expert leads to clear performance degradation, indicating that both global and local update modeling are necessary.
Disabling cascaded spectral modulation further reduces accuracy, highlighting the importance of explicitly modeling coarse-to-fine dependencies rather than combining experts independently.
In addition, the spatial residual expert provides consistent gains by compensating for update patterns not well captured in the frequency domain.

Fig.~\ref{fig:abla} (right) visualizes the routing weights across layers for different experts.
Lower layers allocate more weight to the low-frequency (DCT) expert, reflecting a preference for global structural adaptation.
As depth increases, the routing gradually shifts toward high-frequency (Wavelet), indicating an increased emphasis on localized and fine-grained refinements.
This layer-wise trend is consistent with the intended coarse-to-fine adaptation behavior of CASCADE.

\begin{figure}[t]
  \centering
  \resizebox{1\linewidth}{!}{%
    \begin{minipage}{\linewidth}
      \centering
        \begin{subfigure}[b]{0.495\linewidth}
        \includegraphics[width=\linewidth]{assets/ablation_main.pdf}
      \end{subfigure}
      \hfill
        \begin{subfigure}[b]{0.495\linewidth}
        \includegraphics[width=\linewidth]{assets/router_weights_by_layer.pdf}
      \end{subfigure}
    \end{minipage}
  }
  \caption{Ablation and routing behavior analysis of CASCADE.}
  \label{fig:abla}
\vspace{-10px}
\end{figure}

\section{Related Work}

\paragraph{Parameter-Efficient Fine-Tuning.}
Parameter-efficient fine-tuning (PEFT) adapts large pretrained models by introducing a small number of task-specific parameters while keeping the backbone frozen.~\cite{lialin2023scaling}
Representative approaches include adapter-based methods~\cite{pfeiffer2020adapterhub}, prefix tuning~\cite{li2021prefix}, and low-rank adaptation (LoRA)~\cite{hu2021lora}, which models weight updates under a low-rank assumption.
Subsequent variants improve flexibility via adaptive rank allocation (e.g., AdaLoRA~\cite{zhang2023adalora}), balancing update magnitude and direction (BONE~\cite{kang2024balancing}), or exploring alternative structured parameterizations such as frequency-domain representations (FourierFT~\cite{gao2024parameter}).
More recently, expert-based PEFT methods incorporate routing or mixtures of multiple adaptation modules to improve task decoupling and specialization (e.g., FlyLoRA~\cite{zou2025flylora}, MoELoRA~\cite{luo2024moelora}).
Despite their effectiveness, most PEFT methods still rely on a single dominant structural hypothesis for weight updates, which limits their ability to capture heterogeneous adaptation patterns that involve both global and localized refinements.

\paragraph{Frequency-Domain and Structured Adaptation.}
Beyond low-rank factorization, recent work explores parameterizing weight updates in transformed domains~\cite{zhang2025f}.
Methods like FourierFT represent weight updates in the Fourier domain using global frequency components~\cite{gao2024parameter,shen2024parameter}.
Wavelet-based approaches adopt multi-resolution representations to capture both global structure and localized variations, and LoCA further incorporates location-aware parameterization on cosine representations to model structured, position-sensitive updates~\cite{hu2025waveletft,du2025loca}.
By associating low-frequency components with smooth global structure and high-frequency components with localized variations, these methods offer an alternative spectral perspective on adaptation.
However, most frequency-domain approaches adopt a single transform or scale and treat frequencies independently, failing to model coarse-to-fine interactions and to coordinate global and local refinements.

\section{Conclusion}
In this paper, we presented CASCADE, a PEFT framework that models LLM weight updates through heterogeneous experts across frequency and spatial domains.
By explicitly decomposing weight updates into global low-frequency structures, localized high-frequency refinements, and residual spatial corrections, CASCADE provides a unified and expressive representation of diverse adaptation behaviors.
A key contribution of CASCADE is the cascaded spectral modulation mechanism, which establishes an explicit coarse-to-fine dependency between global and local updates, thereby improving the coherence and consistency of the adaptation process.
In addition, the spectral complexity-aware routing mechanism enables adaptive expert combination.
Extensive experiments across multiple backbone models, tasks, and model scales demonstrate that CASCADE consistently outperforms existing PEFT methods.
These results show that explicitly modeling heterogeneous update structures and their dependencies is effective and robust for LLM adaptations.

\appendix

\section{Experimental Details}
\label{sec:appendix}

\subsection{Training Setup}
All experiments are conducted on NVIDIA GeForce RTX 3090.
We employ DeepSpeed with ZeRO Stage~2 optimization for memory-efficient training.
The training configuration is kept consistent across all methods to ensure fair comparison.
Specifically, we use a per-device batch size of 2 with 2 gradient accumulation steps,
resulting in an effective batch size of 4.
The learning rate is set to $1 \times 10^{-4}$ with a cosine learning rate scheduler
and a warmup ratio of 0.1.
All models are trained with a maximum sequence length of 2048 tokens.
All training is performed in \texttt{bfloat16} precision using the FusedAdam optimizer
with momentum parameters $\beta_1 = 0.9$ and $\beta_2 = 0.95$.


\subsection{Software and Environment}

The experiments were conducted using the following software packages and versions for reproducibility:

\begin{itemize}
\item torch==2.1.2
\item deepspeed==0.12.6
\item numpy==1.26.4
\item peft==0.16.0
\item transformers==4.47.1
\item tokenizers==0.21.2
\item CUDA==12.1
\end{itemize}

The hardware environment configuration is as follows:

\begin{itemize}[leftmargin=*]
\item OS: Ubuntu 20.04 LTS
\item CPU: Intel Xeon Silver 4214R
\item GPU: NVIDIA GeForce RTX 3090
\item Memory: 512GB RAM
\end{itemize}
Detailed implementation and datasets can be found in our codebase\footnote{\codelink}.


\subsection{CASCADE Configuration}
CASCADE applies heterogeneous frequency-domain and spatial-domain experts
to all linear projection layers in the Transformer architecture.
The low-frequency expert uses 20,000 DCT coefficients selected by Manhattan
distance from the DC component.
The high-frequency expert adopts a single-level 2D Haar wavelet transform
with a total of 10,000 learnable coefficients distributed across the three
detail subbands (LH, HL, HH), while the approximation subband is fixed to zero.
The spatial residual expert is parameterized as a low-rank adapter with rank $r=48$.
A lightweight routing module produces input-dependent expert weights.
Auxiliary regularization includes load-balancing and spectral orthogonality losses,
both weighted by 0.01.

\subsection{Baselines}
We compare CASCADE with representative parameter-efficient fine-tuning methods
spanning low-rank adaptation, frequency-domain parameterization,
and mixture-of-experts approaches.
These include LoRA, AdaLoRA, BONE, FourierFT, LoCA, and FlyLoRA.
All baseline methods apply adapters to linear layers following the same configuration
as CASCADE.

\section{Evaluation Protocol and Metrics}

\subsection{Generation Procedure.}
All model outputs are generated using auto-regressive decoding via the \texttt{generate()} API in Hugging Face Transformers.
We employ greedy decoding~(\texttt{do\_sample=False}), and set a maximum of 256 new tokens~(\texttt{max\_new\_tokens=256}).

Each input follows a unified instruction template, as shown below:
\begin{tcolorbox}[boxrule=0.8pt]
\textless s\textgreater Below is an instruction that describes a task. Write a response that appropriately completes the request.

\#\#\# Instruction:\\
\{instruction\}
\\
\\
\#\#\# Response:
\end{tcolorbox}

\subsection{Answer Extraction and Accuracy Calculation.}
Results are calculated based on extracted predictions from generated outputs using task-specific regular expressions:

\begin{itemize}[leftmargin=*]
\item \textit{Commonsense QA:} Extracted exact match answers (true/ false, solution/answer/ending options) and computed accuracy by direct matching against ground truth labels.
\item \textit{Arithmetic QA:} Extracted numerical answers from output text (with absolute tolerance of $10^{-3}$) or alphabetic choices (A-E) for the AQuA dataset.
\end{itemize}

All extraction and accuracy computation scripts are provided for reproducibility in our codebase.


\section{Dataset Details}

\subsection{Training Datasets}
We utilize two unified instruction-tuning datasets provided by LLM-Adapters~\cite{hu2023llm}:
\begin{itemize}[leftmargin=*, topsep=0pt]
	\item \textbf{Commonsense15K} covers a wide range of commonsense reasoning questions. All examples are template-normalized into a consistent instruction format, supporting robust cross-task generalization.
	\item \textbf{Math10K} comprises diverse math word problems, each annotated with a step-by-step chain-of-thought solution and a final answer, enabling thorough evaluation of arithmetic reasoning under instruction-following settings.
\end{itemize}
The summary of dataset statistics is provided in Table~\ref{tab:dataset}.

\begin{table}[t]
    \centering
	\small
    \resizebox{0.95\linewidth}{!}{
    \renewcommand{\arraystretch}{1.01}
    \begin{tabular}{lccc}
        \toprule
        \textbf{Dataset}     & \textbf{Samples}     & \textbf{Total Tokens} & \textbf{Avg. Tokens/Sample}  \\
        \midrule
        Commonsense15K        & 15,119          &   1,778,782                & 117.65 \\
        Math10K        & 9,919          &   2,273,016                & 229.16 \\
        \bottomrule
    \end{tabular}
    }
	\caption{Statistics of the training datasets for commonsense and arithmetic QA tasks.
	}
    \label{tab:dataset}
\end{table}


\noindent \textbf{a) Commonsense QA:}
        \begin{itemize}[leftmargin=1em]
			\item \textbf{BoolQ}~\cite{clark2019boolq}: BoolQ is a yes/no question answering dataset featuring naturally occurring, information-seeking queries and passage-based inference.
			\item \textbf{PIQA}~\cite{bisk2020piqa}: PIQA is a benchmark for physical commonsense reasoning, focused on practical everyday tasks with two candidate solutions.
			\item \textbf{SIQA}~\cite{sap2019socialiqa}: Social IQa is a multiple-choice benchmark that tests social and emotional commonsense reasoning in daily situations.
			\item \textbf{ARC-Challenge / ARC-Easy}~\cite{clark2018think}: The AI2 Reasoning Challenge (ARC) is a science question answering benchmark consisting of grade-school level, multiple-choice questions divided into Easy and Challenge subsets by difficulty.
			\item \textbf{OBQA}~\cite{mihaylov2018can}: OpenBookQA is a science question answering benchmark requiring multi-step reasoning over a provided set of core science facts.
			\item \textbf{HellaSwag}~\cite{zellers2019hellaswag}: HellaSwag is a natural language inference benchmark with adversarially-filtered continuations requiring robust commonsense reasoning.
			\item \textbf{WinoGrande}~\cite{sakaguchi2020winogrande}: WinoGrande is a binary fill-in-the-blank pronoun resolution benchmark designed to require advanced commonsense reasoning.
        \end{itemize}
\begin{table}[t]
\centering
\small
\resizebox{1\linewidth}{!}{
    \renewcommand{\arraystretch}{1.01}
	\begin{tabular}{lcc}
	\toprule
	\textbf{Dataset} & \textbf{Samples} & \textbf{Answer Format} \\
	\midrule
	BoolQ         & 3,270  & true / false \\
	PIQA          & 1,838  & solution1 / solution2 \\
	SIQA    & 1,954  & answer1 / answer2 / answer3 \\
	ARC-Challenge & 1,172  & answer1 / answer2 / answer3 / answer4 \\
	ARC-Easy      & 2,376  & answer1 / answer2 / answer3 / answer4 \\
	OBQA    & 500    & answer1 / answer2 / answer3 / answer4 \\
	HellaSwag     & 10,042 & ending1 / ending2 / ending3 / ending4 \\
	WinoGrande    & 1,267  & option1 / option2 \\
	\bottomrule
	\end{tabular}
}
\caption{Statistics of Commonsense QA Test Datasets.}
\label{tab:commonsense-datasets}
\end{table}

\noindent \textbf{b) Arithmetic QA:}
        \begin{itemize}[leftmargin=1em]
			\item \textbf{MultiArith}~\cite{roy2016solving}: MultiArith contains multi-step arithmetic word problems to evaluate a system's ability to handle complex reasoning chains.
			\item \textbf{GSM8K}~\cite{cobbe2021training}: GSM8K is a dataset of multiple linguistically diverse grade school math word problems, designed for benchmarking multi-step arithmetic reasoning with natural language solutions.
			\item \textbf{AddSub}~\cite{hosseini2014learning}: AddSub is a corpus of short word problems focused exclusively on addition and subtraction, used to assess basic arithmetic reasoning capabilities.
			\item \textbf{AQuA}~\cite{ling2017program}: AQuA is a large-scale dataset of algebraic word problems, each paired with natural language rationales to support step-by-step reasoning.
			\item \textbf{SingleEq}~\cite{koncel2015parsing}: SingleEq is a collection of multi-sentence algebraic word problems, emphasizing equation tree parsing and formal reasoning.
			\item \textbf{SVAMP}~\cite{patel2021nlp}: SVAMP is a challenge set constructed from elementary math word problems, aimed at evaluating a model's robustness to question sensitivity, structural variations, and reasoning challenges.
			\item \textbf{MAWPS}~\cite{koncel2016mawps}: MAWPS is a repository of multiple math word problems, offering a unified benchmark for evaluating models.
        \end{itemize}

\begin{table}[t]
\centering
\small
\resizebox{0.8\linewidth}{!}{
    \renewcommand{\arraystretch}{0.95}
	\begin{tabular}{lcc}
	\toprule
	\textbf{Dataset} & \textbf{Samples} & \textbf{Answer Type} \\
	\midrule
	MultiArith & 600   & Numeric \\
	GSM8K      & 1,319 & Numeric \\
	AddSub     & 395   & Numeric \\
	AQuA       & 254   & Multiple Choice (A--E) \\
	SingleEq   & 508   & Numeric \\
	SVAMP      & 1,000 & Numeric \\
	MAWPS      & 238   & Numeric \\
	\bottomrule
	\end{tabular}
}
\caption{Statistics of Arithmetic QA Test Datasets.}
\label{tab:arith-datasets}
\end{table}


\subsection{Evaluation Benchmarks}
We evaluate model performance on a suite of well-established commonsense and arithmetic QA benchmarks, enabling comprehensive evaluation of both generalization and robustness.
Detailed statistics for all evaluation datasets can be found in Table~\ref{tab:commonsense-datasets}~(Commonsense) and Table \ref{tab:arith-datasets}~(Arithmetic).