时空ver最后的回忆

2026-03-19 02:28:50 +08:00
parent 61174433d0
commit cf02f82db0
172 changed files with 22604 additions and 441 deletions
--- a/mypaper/KDD2026_AgentCity.tex
+++ b/mypaper/KDD2026_AgentCity.tex
@@ -0,0 +1,825 @@
+\title[AgentCity: An AI-Maintained Continuous Benchmark for Traffic Prediction]{AgentCity: An AI-Maintained Continuous Benchmark \\ for Traffic Prediction}
+
+\input{misc}
+
+
+\section{Introduction}
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=1.\linewidth]{assets/Agent_Promo_NG.png}
+\caption{AgentCity: A multi-agent system for continuous traffic prediction benchmarking.}
+    \label{fig:placeholder}
+\end{figure}
+
+Traffic prediction is a fundamental component of data-driven intelligent transportation systems, supporting a wide range of applications such as traffic management, route planning, mobility analysis, and urban decision-making.
+In recent years, advances in deep learning have led to a rapid growth of traffic prediction models, covering diverse tasks including traffic state prediction, trajectory forecasting, travel time estimation, and map matching. These models vary substantially in architectural design, modeling assumptions, data requirements, and evaluation settings.
+
+Consequently, benchmarks play a critical role in enabling systematic evaluation. By providing standardized datasets, clearly defined tasks, and consistent evaluation protocols, benchmarks allow fair and reproducible comparison of model performance across studies, and support empirical analysis within the research community. To this end, several benchmarking frameworks have been proposed for traffic and spatiotemporal prediction. Representative examples include DL-Traff~\cite{Dl-traff}, LibCity~\cite{Libcity}, and TorchSpatial~\cite{Torchspatial}, which aim to standardize data preprocessing, task definitions, and evaluation pipelines across a range of prediction tasks. These efforts establish a more consistent basis for empirical comparison.
+
+However, existing benchmarks share a fundamental limitation: they rely on \textbf{manual, human-centered maintenance}, which introduces several structural challenges.
+
+First, \textbf{\emph{limited scalability}} constrains benchmark coverage. The traffic prediction literature continues to expand rapidly, with a large number of new models published each year. These models are implemented using diverse frameworks, code structures, and data interfaces, making their manual integration into a unified benchmark labor-intensive and difficult to sustain at scale. As a result, benchmark coverage often lags behind recent research progress.
+
+Second, \textbf{\emph{static evaluation pipelines}} limit continuous assessment. Most existing benchmarks are built upon fixed datasets and evaluation procedures, whereas real-world transportation systems evolve continuously, with changes in road networks, travel demand, and mobility patterns. Although some datasets are periodically updated, incorporating these updates into existing benchmarks typically requires additional manual effort, limiting long-term and continuous evaluation.
+
+Third, \textbf{\emph{inconsistent evaluation settings}} weaken result comparability. Results reported in original papers are often obtained under carefully tuned configurations tailored to specific datasets and tasks, while benchmark implementations typically rely on default or minimally tuned settings. This difference can lead to deviations from reported results and reduces the benchmark’s reliability as a fair reference for model assessment.
+
+
+Together, these challenges indicate that a key limitation of traffic prediction benchmarking is no longer the absence of standardized frameworks, but the lack of a \emph{continuous}, \emph{scalable}, and \emph{consistently evaluated} maintenance mechanism that treats benchmark construction as an ongoing process rather than a one-time effort.
+
+In this work, we propose \textbf{AgentCity}, an \textbf{AI-maintained} framework for the continuous construction and evaluation of traffic prediction benchmarks.
+AgentCity replaces manual, human-centered benchmark maintenance with an automated pipeline that systematically retrieves recent literature, integrates external model and dataset implementations, and evaluates models under unified and consistent protocols.
+
+
+AgentCity structures benchmark maintenance as a coordinated workflow consisting of three core components: \emph{literature retrieval}, \emph{model and data integration}, and \emph{standardized evaluation}.
+These components respectively support the automated discovery of relevant studies, the reproduction and integration of external models and datasets into a unified evaluation framework, and the fair assessment of models under consistent data processing, training, and evaluation settings.
+Within this process, controlled hyperparameter tuning is applied to each model on each task under a unified protocol, ensuring fair and comparable evaluation.
+The overall workflow is coordinated by a multi-agent system, enabling scalable and robust benchmark maintenance over time.
+
+
+
+Built upon AgentCity, we construct a continuously evolving traffic prediction benchmark that currently aggregates 74 representative models across multiple tasks and datasets.
+All models are evaluated using unified evaluation protocols, enabling reproducible and comparable assessment across methods.
+The AgentCity framework and benchmark are publicly available online, with configurations and evaluation results for reproducibility.
+
+
+Our main contributions are summarized as follows:
+\begin{itemize}[leftmargin=*, topsep=0pt]
+    \item We propose \textbf{AgentCity}, the first \textbf{AI-maintained} framework designed for continuous construction and evaluation of \textbf{traffic prediction benchmarks}.
+    \item We develop a multi-agent workflow that automates key benchmark maintenance processes, including literature retrieval, model and data integration, and standardized evaluation.
+    \item We release a large-scale, continuously updated traffic prediction benchmark and public leaderboard built upon AgentCity, supporting reproducible evaluation across tasks and datasets.
+\end{itemize}
+
+
+\begin{table*}[ht]
+\centering
+\caption{Categorization of traffic-related data and their typical representations.}
+\label{tab:st_data_abstraction}
+\resizebox{0.9\linewidth}{!}{
+\begin{tabular}{c c c c}
+\toprule
+\textbf{Data Group} &
+\textbf{Data Category} &
+\textbf{Description} &
+\textbf{Typical Data Form} \\
+\midrule
+\multirow{2}{*}{Static Spatial Structure}
+& Geographical Units &
+Geographical entities defining the spatial domain. &
+$N \times D$ \\
+& Unit Relations &
+Structured relations between spatial units. &
+$N \times N$ \\
+\midrule
+\multirow{3}{*}{Group-level Spatiotemporal Dynamics}
+& Unit-level Dynamics &
+Time-varying attributes defined on spatial units. &
+$T \times N \times D$ \\
+& Grid-level Dynamics &
+Time-varying attributes defined on spatial regions. &
+$T \times I \times J \times D$ \\
+& Origin--Destination Dynamics &
+Time-varying interactions between spatial unit pairs. &
+$T \times N \times N \times D$ \\
+\midrule
+Individual Trajectory Dynamics
+& Trajectory Data &
+Ordered temporal sequences of spatial states. &
+$\{(x_i, t_i)\}_{i=1}^{L}$ \\
+\bottomrule
+\end{tabular}
+}
+\end{table*}
+
+
+\begin{table*}[ht]
+\centering
+\caption{Categorization of traffic prediction tasks and their input--output data categories.}
+\label{tab:task_summary}
+\resizebox{\linewidth}{!}{
+\begin{tabular}{c c c c}
+\toprule
+\textbf{Task} &
+\textbf{Input Data Category} &
+\textbf{Output Data Category} &
+\textbf{Typical Data Form} \\
+\midrule
+Traffic State Prediction &
+Group-level Dynamics \,+\, Unit Relations  &
+Future Unit-level Dynamics &
+$X \in \mathbb{R}^{T_{\text{in}} \times N \times D},\;
+y \in \mathbb{R}^{T_{\text{out}} \times N \times D}$ \\
+\midrule
+Trajectory Location Prediction &
+Trajectory Data \,+\, Geographical Units &
+Next Trajectory Location &
+$[loc_1, \ldots, loc_n] \rightarrow loc_{n+1}$ \\
+\midrule
+ETA Prediction &
+Trajectory Data &
+Travel Time &
+$\{(x_i, t_i)\}_{i=1}^{L} \rightarrow \Delta t$ \\
+\midrule
+Map Matching &
+Trajectory Data \,+\, Geographical Units \,+\, Unit Relations &
+Road segment sequence &
+$\{(lon_i, lat_i, t_i)\}_{i=1}^{L} \rightarrow \{r_j\}_{j=1}^{K}$ \\
+\bottomrule
+\end{tabular}
+}
+\end{table*}
+
+
+\section{Traffic Prediction Data and Tasks}
+\label{sec:background}
+
+This section introduces a unified abstraction of data types and prediction tasks commonly studied in traffic prediction, highlighting the diversity of data organizations and task interfaces that characterize existing traffic prediction benchmarks.
+
+\label{sec:background}
+\subsection{Traffic-Related Data Categories}
+Traffic prediction data differ from homogeneous modalities such as images or text by combining spatial entities, relational structures, and time-indexed observations.
+In traffic scenarios, these data can be broadly categorized into three groups: static spatial structure, group-level traffic dynamics, and individual trajectory dynamics.
+
+
+\paratitle{Static Spatial Structure.}
+Static spatial structure describes the fixed spatial context of a traffic system.
+It includes geographical units that define the spatial domain, such as sensors, road segments, or regions, as well as structured relations between these units, such as network connectivity or adjacency.
+This category provides the spatial foundation upon which traffic observations are organized.
+
+\paratitle{Group-level Traffic Dynamics.}
+Group-level traffic dynamics capture time-varying attributes defined over spatial units or their relations, including traffic speed, flow, or density measured at sensors or regions.
+Such data are usually represented as time-indexed tensors defined on nodes, grids, or origin--destination pairs.
+
+
+\paratitle{Individual Trajectory Dynamics.}
+Individual trajectory dynamics describe fine-grained mobility behavior of individual trips, represented as spatiotemporal state sequences.
+
+Table~\ref{tab:st_data_abstraction} summarizes these data categories and their typical representations.
+Throughout this paper, $N$ denotes the number of spatial units, $T$ the number of time steps, $D$ the feature dimension, $I$ and $J$ the numbers of grid rows and columns, and $L$ the trajectory length.
+\subsection{Traffic Prediction Tasks}
+Based on the data categories above, we consider four representative traffic prediction tasks with different data categories and input--output structures, as summarized in Table~\ref{tab:task_summary}.
+
+
+\paratitle{Traffic state prediction}
+forecasts future traffic dynamics over a fixed set of spatial units.
+The input consists of historical group-level dynamics,
+$X \in \mathbb{R}^{T_{\text{in}} \times N \times D}$,
+and the output is a sequence of future unit-level dynamics,
+$y \in \mathbb{R}^{T_{\text{out}} \times N \times D}$.
+
+\paratitle{Trajectory location prediction}
+focuses on next-step prediction for individual trajectories.
+Given a historical trajectory represented as an ordered sequence of locations $[loc_1, \ldots, loc_n]$, the task predicts the next location $loc_{n+1}$.
+The input trajectories are variable in length, and the outputs are discrete spatial states.
+
+
+\paratitle{Estimated time of arrival (ETA) prediction} aims to estimate the travel duration of a trajectory.
+The input is an individual trajectory represented as a sequence of spatiotemporal points
+$\{(x_i, t_i)\}_{i=1}^{L}$,
+and the output is a scalar value representing the estimated travel time.
+
+\paratitle{Map matching}
+aims to infer the most likely network-constrained path that corresponds to an observed trajectory.
+Given noisy or sparse trajectory observations, the task outputs an ordered sequence of road segments that is consistent with the underlying network topology.
+
+\section{Methodology}
+\subsection{Overview}
+\label{sec:overview}
+
+AgentCity is a multi-agent framework designed to support the continuous construction and evaluation of traffic prediction benchmarks.
+Built on top of LibCity~\cite{Libcity}, AgentCity enables the automated discovery, reproduction, and evaluation of traffic prediction models under unified task definitions and evaluation protocols.
+Given user-specified keywords and constraints, the system incrementally identifies relevant studies, integrates their models and associated datasets, and evaluates them in a consistent manner.
+
+As illustrated in Figure~\ref{fig:overview}, AgentCity organizes the overall process into three sequential stages: \emph{Literature Retrieval}, \emph{Model and Data Integration}, and \emph{Standardized Evaluation}.
+Each stage is managed by a dedicated \emph{Stage Leader Agent}, which is responsible for planning the stage workflow, coordinating specialized \emph{Subagents}, and validating intermediate results.
+Literature Retrieval focuses on identifying relevant models within a controlled search scope.
+Model and Data Integration handles the reproduction and adaptation of external model implementations and datasets into unified task interfaces.
+Standardized Evaluation assesses all integrated models under consistent data processing, training, and evaluation settings.
+
+To accommodate heterogeneous implementations and incomplete specifications commonly found in research code, AgentCity supports iterative refinement within each stage.
+When intermediate results do not satisfy predefined validation criteria, the corresponding Stage Leader Agent selectively re-invokes relevant Subagents to refine the outcome, with explicit limits on the number of iterations.
+
+Artifacts produced at each stage, including structured metadata, configuration files, and validation summaries, are recorded and propagated across stages by a Global Coordinator.
+This allows subsequent stages to operate based on established information while maintaining a clear separation of responsibilities.
+Together, these components form a structured workflow that enables scalable and reproducible benchmark construction for traffic prediction.
+
+\begin{figure*}
+    \centering
+    \includegraphics[width=0.8\linewidth]{agentv2.pdf}
+\caption{AgentCity framework overview.
+Benchmark construction is organized into three stages: Literature Retrieval, Model and Data Integration, and Standardized Evaluation.
+Each stage is coordinated by a Leader Agent that invokes specialized Subagents to perform stage-specific operations.
+}
+
+    \label{fig:overview}
+\end{figure*}
+
+
+
+\subsection{Stage I: Literature Retrieval}
+\label{sec:literature}
+The Literature Retrieval stage collects research work related to a given traffic prediction task and produces a structured set of candidate models for downstream integration and evaluation.
+This stage defines a documented search and filtering procedure and records the resulting candidates and associated metadata.
+It is managed by a \emph{Retrieval Leader Agent}, which coordinates multiple Subagents to perform concrete operations.
+
+\paratitle{Paper Searcher.}
+The Paper Searcher retrieves candidate papers using keyword-based queries derived from user input or a predefined set of task-specific keywords.
+Additional constraints, such as publication venues or time ranges, can be specified to delimit the search scope.
+This step collects studies related to the target traffic prediction task across different modeling approaches.
+
+\paratitle{Paper Evaluator.}
+The Paper Evaluator examines each retrieved paper to determine whether it provides the information required for subsequent model and data integration.
+The evaluation checks whether the paper specifies the prediction task, model formulation, input--output definitions, experimental setup, and evaluation metrics.
+Papers that lack information required for model implementation, data preparation, or evaluation are excluded at this stage.
+
+\paratitle{Paper Analyzer.}
+For papers retained after evaluation, the Paper Analyzer extracts information needed for later stages.
+This includes references to model architectures, code repositories, descriptions of datasets and preprocessing steps, training and evaluation settings, and reported metrics.
+The extracted information is organized into a structured representation for use in model and data integration.
+
+\paratitle{Stage execution.}
+The Retrieval Leader Agent executes the search, evaluation, and analysis steps in sequence.
+When the resulting paper set does not satisfy predefined criteria, such as coverage of the target task or completeness of extracted metadata, the leader reviews the execution outcomes and re-executes the relevant steps.
+The output of this stage is a structured collection of candidate models and associated metadata, which is passed to the subsequent integration stage.
+
+
+
+
+\subsection{Stage II: Model and Data Integration}
+\label{sec:migration}
+
+The Model and Data Integration stage reproduces external traffic prediction models together with their associated datasets and aligns them with unified task interfaces for evaluation.
+This stage transforms heterogeneous research implementations into executable benchmark components that follow consistent data organization, training procedures, and evaluation protocols.
+It is coordinated by a \emph{Integration Leader Agent}, which manages a set of Subagents responsible for concrete integration steps.
+
+\paratitle{Source Collector.}
+The Source Collector retrieves the resources required for reproduction, including model implementations, configuration files, and dataset references extracted in Stage~I.
+It analyzes the structure of the retrieved codebase to identify model definitions, training pipelines, data loading logic, and external dependencies.
+The collected sources serve as the basis for subsequent integration.
+
+\paratitle{Model and Data Adapter.}
+The Model and Data Adapter performs the core integration work.
+For models, it aligns architecture definitions, input--output formats, and training interfaces with the benchmark’s task specifications.
+For datasets, it handles dataset acquisition, preprocessing alignment, feature construction, and data split configuration according to the benchmark protocol.
+
+\paratitle{Configuration Assembler.}
+The Configuration Assembler constructs unified configuration files that combine model settings, dataset parameters, and training options.
+Reported hyperparameters and experimental settings from the original paper are incorporated when available.
+When details are unspecified, task-consistent defaults defined by the benchmark are applied.
+The resulting configurations define a complete and executable evaluation setup.
+
+\paratitle{Integration Validator.}
+The Integration Validator executes a validation run using the assembled model and dataset configuration.
+It verifies model initialization, data loading, and basic training execution, and records logs to assess integration completeness.
+
+
+
+\paratitle{Stage execution.}
+The Integration Leader Agent executes source collection, adaptation, configuration assembly, and validation in sequence, and re-invokes relevant Subagents when validation criteria are not satisfied.
+The output of this stage is an executable model--dataset pair together with structured configurations and validation records, which are passed to the evaluation stage.
+
+
+
+\begin{figure*}
+    \centering
+    \includegraphics[width=1\linewidth]{pie_combined.png}
+\caption{Distributions of studies included in the benchmark.
+The figure shows the distribution of collected papers by publication venue (left), publication year (middle), and traffic prediction task (right).}
+    \label{fig:analysis}
+\end{figure*}
+
+\subsection{Stage III: Standardized Evaluation}
+\label{sec:evaluation}
+
+The Standardized Evaluation stage evaluates integrated traffic prediction models under unified training and evaluation protocols to produce comparable performance results across models.
+It is coordinated by an \emph{Evaluation Leader Agent}, which oversees a small set of Subagents responsible for execution and result aggregation.
+
+\paratitle{Evaluation Planner.}
+The Evaluation Planner specifies the evaluation configuration for each model--task pair, including training settings, evaluation metrics, and the hyperparameter ranges defined by the benchmark protocol.
+
+\paratitle{Evaluation Executor.}
+The Evaluation Executor runs model training and evaluation using the specified configurations.
+During execution, it records performance metrics, training dynamics, and runtime information required for result reporting and analysis.
+
+\paratitle{Result Collector.}
+The Result Collector aggregates evaluation outputs across runs, identifies the best-performing configurations according to task-specific metrics, and organizes the results into standardized records for benchmarking.
+
+\paratitle{Stage execution.}
+The Evaluation Leader Agent coordinates planning, execution, and result collection, and re-invokes relevant steps when evaluation results are invalid or incomplete.
+The output of this stage is a set of standardized evaluation results that can be directly compared across models.
+
+\subsection{Implementation Details}
+\label{sec:implementation}
+
+AgentCity is implemented as a coordinated multi-agent system centered around a \emph{Global Coordinator}.
+The coordinator maintains a shared execution context and dispatches stage-specific \emph{Leader Agents} to execute the three benchmark stages in sequence.
+Each Leader Agent manages its workflow by invoking Subagents, validating intermediate outputs, and controlling stage execution.
+
+\paratitle{Agent Coordination and Control.}
+Leader Agents follow a unified control pattern, decomposing each stage into executable steps, invoking Subagents for concrete operations, and collecting structured outputs.
+Subagents encapsulate task-specific functions such as literature querying, source acquisition, code adaptation, dataset preparation, model execution, and result aggregation.
+
+
+\paratitle{Cross-Stage Context Propagation.}
+The Global Coordinator maintains a shared execution context that records structured artifacts produced at each stage.
+These artifacts are propagated across stages to support subsequent execution without repeating earlier steps.
+
+
+\paratitle{Model Backend Configuration.}
+Different language model backends can be assigned to agents according to task requirements.
+Code-related and diagnostic tasks use more capable backends, while routine operations may use lighter-weight ones.
+Backend selection is specified through system configuration and is independent of the overall workflow structure.
+
+
+\begin{table}[t]
+    \centering
+    \caption{Traffic prediction datasets in AgentCity.}
+    \label{tab:dataset_stats}
+    \resizebox{\linewidth}{!}{
+    \begin{tabular}{l p{7cm}}
+        \toprule
+        \textbf{Task} & \textbf{Dataset} \\
+        \midrule
+        Traffic State Prediction & METR-LA\cite{METR_LA/PEMS_BAY}, PEMSD7(M)\cite{PEMSD7M}, PEMS-BAY\cite{METR_LA/PEMS_BAY}, PEMSD3\cite{PEMSD3/7}, PEMSD4\cite{PEMSD4/8}, PEMSD7\cite{PEMSD3/7}, PEMSD8\cite{PEMSD4/8}, TAXIBJ\cite{TaxiBJ}, T-DRIVE\cite{T-drive}, NYCTaxi\cite{NYCTaxi/Bike}, NYCBike\cite{NYCTaxi/Bike}, LargeST\cite{LargeST}  \\
+        Traj. Loc. Prediction & Gowalla\cite{Gowalla/BrightKite}, Foursquare-TKY\cite{Foursquare-NYC/TKY}, Foursquare-NYC\cite{Foursquare-NYC/TKY}, BrightKite\cite{Gowalla/BrightKite}, Instagram\cite{Instagram}, Singapore\cite{Singapore}, Porto\cite{Porto}    \\
+        ETA Prediction & Chengdu\cite{Chengdu/DeepTTE}, Beijing\cite{Beijing/TTPNet}, Porto\cite{Porto}, NYCTaxi\cite{NYCTaxi/Bike}, NYCBike\cite{NYCTaxi/Bike}    \\
+        Map Matching & Global\cite{Global}(Neftekamsk, Ruzhany, Spaichingen, Valky), Seattle\cite{Seattle}   \\
+        \bottomrule
+    \end{tabular}}
+\end{table}
+
+\begin{table}[t]
+    \centering
+    \caption{Traffic prediction models in AgentCity.}
+    \label{tab:model_stats}
+    \resizebox{\linewidth}{!}{
+    \begin{tabular}{l p{9cm}}
+        \toprule
+        \textbf{Task} & \textbf{Model} \\
+        \midrule
+        Traffic State Prediction & 
+        STSSDL\cite{STSSDL}, STAEformer\cite{STAEformer}, AutoSTF\cite{AutoSTF}, STDMAE\cite{STDMAE}, 
+        EAC\cite{EAC}, GriddedTNP\cite{GriddedTNP}, PatchSTG\cite{PatchSTG}, SRSNet\cite{SRSNet}, 
+        FlashST\cite{FlashST}, ConvTimeNet\cite{Convtimenet}, Fredformer\cite{Fredformer}, Pathformer\cite{Pathformer}, 
+        HTVGNN\cite{HTVGNN}, PatchTST\cite{PatchTST}, DCST\cite{DCST}, STLLM\cite{STLLM}, 
+        T-graphormer\cite{T-graphormer}, CKGGNN\cite{CKGGNN}, EasyST\cite{EasyST}, LEAF\cite{LEAF}, 
+        MetaDG\cite{MetaDG}, TRACK\cite{TRACK}, HiMSNet\cite{HiMSNet}, DST2former\cite{DST2former}, 
+        DSTMamba\cite{DSTMamba}, BigST\cite{BigST}, ASeer\cite{ASeer}, STHSepNet\cite{STHSepNet}, 
+        STWave\cite{STWave}, HSTWAVE\cite{HSTWAVE}, DSTAGNN\cite{DSTAGNN}, RSTIB\cite{RSTIB}, 
+        LSTTN\cite{LSTTN}, LightST\cite{LightST}, TimeMixer++\cite{TimeMixer++}, STID\cite{STID}, UniST\cite{UniST} \\
+        Traj. Loc. Pred & 
+        DeepMove\cite{DeepMove}, PLMTrajRec\cite{PLMTrajRec}, START\cite{START}, LoTNext\cite{LoTNext}, 
+        RNTrajRec\cite{RNTrajRec}, CoMaPOI\cite{CoMaPOI}, JGRM\cite{JGRM}, TrajSDE\cite{TrajSDE}, 
+        DCHL\cite{DCHL}, GNPRSID\cite{GNPRSID}, PLSPL\cite{PLSPL}, GETNext\cite{GETNEXT}, 
+        CANOE\cite{CANOE}, TPG\cite{TPG}, CLSPRec\cite{CLSPRec}, AGRAN\cite{AGRAN}, 
+        LightPath\cite{LightPath}, ROTAN\cite{ROTAN}, FPMC\cite{FPMC}, PRME\cite{PRME} \\
+        ETA Prediction & 
+        DOT\cite{DOT}, MetaTTE\cite{MetaTTE}, MVSTM\cite{MVSTM}, DutyTTE\cite{DutyTTE}, 
+        TTPNet\cite{TTPNet}, MTSTAN\cite{MTSTAN}, MulT-TTE\cite{MulT-TTE}, MDTI\cite{MDTI}, 
+        ProbETA\cite{ProbETA}, HierETA\cite{HierETA}, HetETA\cite{HetETA} \\
+        Map Matching & 
+        DeepMM\cite{DeepMM}, GraphMM\cite{GraphMM}, DiffMM\cite{DiffMM}, TRMMA\cite{TRMMA}, 
+        L2MM\cite{L2MM}, RLOMM\cite{RLOMM}, FMM\cite{FMM}, HMMM\cite{HMMM}, STMatching\cite{STMatching} \\
+        \midrule
+    \end{tabular}}
+\end{table}
+
+\begin{table*}[t]
+\centering
+\caption{Task-wise datasets, data scale statistics, and evaluation metrics used in the benchmark.
+$N$, $E$, and $U$ denote the numbers of nodes, edges, and users, respectively.
+$T$ denotes the total volume of data records, corresponding to the accumulated traffic flow observations for Traffic State Prediction and the total number of trajectory points or check-ins for the other tasks.}
+
+\label{tab:task_dataset_overview}
+\resizebox{\linewidth}{!}{
+\begin{tabular}{l l l l l}
+\toprule
+\textbf{Task} & \textbf{Dataset} & \textbf{Scale ($N/E/U/T$)}
+ & \textbf{Time Span} & \textbf{Metrics} \\
+\midrule
+
+\multirow{3}{*}{Traffic State Prediction}
+& METR-LA
+& $N{=}207$, $E{=}11{,}753$, $T{=}7.1$M
+& Mar. 2012 -- Jun. 2012
+& MAE$\downarrow$, RMSE$\downarrow$ \\
+
+& PEMSD7
+& $N{=}228$, $E{=}51{,}984$, $T{=}2.9$M
+& May. 2017 -- Aug. 2017
+& MAE$\downarrow$, RMSE$\downarrow$ \\
+
+& PEMS-BAY
+& $N{=}325$, $E{=}8{,}358$, $T{=}16.9$M
+& Jan. 2017 -- Jun. 2017
+& MAE$\downarrow$, RMSE$\downarrow$ \\
+
+\midrule
+
+\multirow{3}{*}{Trajectory Location Prediction}
+& Foursquare\_NYC
+& $N{=}38{,}332$, $U{=}1{,}082$, $T{=}227$K
+& Apr. 2012 -- Feb. 2013
+& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
+
+& Foursquare\_TKY
+& $N{=}61{,}857$, $U{=}2{,}292$, $T{=}574$K
+& Apr. 2012 -- Feb. 2013
+& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
+
+& Singapore
+& $N{=}20{,}153$, $U{=}17{,}744$, $T{=}696$K
+& Jan. 2017 -- Jun. 2017
+& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
+
+\midrule
+
+\multirow{2}{*}{ETA Prediction}
+& Beijing
+& $N{=}16{,}383$, $U{=}76$, $T{=}518$K
+& Oct. 2013
+& MAE$\downarrow$, MAPE$\downarrow$, RMSE$\downarrow$ \\
+
+& Chengdu
+& $N{=}440{,}056$, $U{=}4{,}565$, $T{=}712$K
+& Aug. 2014
+& MAE$\downarrow$, MAPE$\downarrow$, RMSE$\downarrow$ \\
+
+\midrule
+
+\multirow{5}{*}{Map Matching}
+& Neftekamsk
+& $N{=}18{,}195$, $E{=}41{,}971$, $T{=}2.5$K
+& 2015
+& RMF$\downarrow$, AL$\uparrow$ \\
+
+& Santander
+& $N{=}24{,}217$, $E{=}48{,}100$, $T{=}653$
+& 2015
+& RMF$\downarrow$, AL$\uparrow$ \\
+
+& Spaichingen
+& $N{=}4{,}575$, $E{=}9{,}992$, $T{=}517$
+& 2015
+& RMF$\downarrow$, AL$\uparrow$ \\
+
+& Valky
+& $N{=}1{,}578$, $E{=}3{,}142$, $T{=}1.0$K
+& 2015
+& RMF$\downarrow$, AL$\uparrow$ \\
+
+\bottomrule
+\end{tabular}}
+\end{table*}
+\begin{table}[t]
+    \centering
+    \caption{Traffic state prediction leaderboard on METR\_LA, PEMSD7, and PEMS\_BAY under unified evaluation protocols.}
+
+    \label{tab:traffic_leaderboard}
+    \resizebox{1\linewidth}{!}{
+    \begin{tabular}{l cc cc cc}
+        \toprule
+        \textbf{Model} &
+        \multicolumn{2}{c}{\textbf{METR\_LA}} &
+        \multicolumn{2}{c}{\textbf{PEMSD7}} &
+        \multicolumn{2}{c}{\textbf{PEMS\_BAY}} \\
+        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
+        & MAE$\downarrow$ & RMSE$\downarrow$
+        & MAE$\downarrow$ & RMSE$\downarrow$
+        & MAE$\downarrow$ & RMSE$\downarrow$ \\
+        \midrule
+        STAEformer\cite{STAEformer} & 2.962 & 5.984 & 18.96 & 32.28 & 1.532 & 3.446 \\
+        DCST\cite{DCST}       & 3.090 & 6.334 & 19.39 & 32.72 & 1.561 & 3.483 \\
+        DST2former\cite{DST2former} & 3.095 & 6.240 & 19.67 & 32.61 & 1.639 & 3.587 \\
+        STDMAE\cite{STDMAE}     & 3.096 & 6.230 & 20.19 & 32.99 & 1.579 & 3.502 \\
+        EasyST\cite{EasyST}     & 3.115 & 6.419 & 19.49 & 32.48 & 1.565 & 3.509 \\
+        PatchSTG\cite{PatchSTG}   & 3.127 & 6.316 & 19.99 & 32.90 & 1.589 & 3.580 \\
+        HiMSNet\cite{HiMSNet}    & 3.143 & 6.221 & 23.34 & 36.04 & 1.670 & 3.613 \\
+        STLLM\cite{STLLM}       & 3.151 & 6.284 & 20.92 & 33.65 & 1.616 & 3.592 \\
+        LightST\cite{LightST}    & 3.167 & 6.372 & 22.00 & 34.59 & 1.607 & 3.580 \\
+        STWave\cite{STWave}     & 3.186 & 6.417 & 23.02 & 37.04 & 1.619 & 3.621 \\
+        RSTIB\cite{RSTIB}       & 3.194 & 6.606 & 20.37 & 33.40 & 1.610 & 3.666 \\
+        FlashST\cite{FlashST}    & 3.203 & 6.511 & 22.40 & 35.47 & 1.636 & 3.645 \\
+        BigST\cite{BigST}       & 3.218 & 6.359 & 21.11 & 34.18 & 1.622 & 3.538 \\
+        TRACK\cite{TRACK}       & 3.278 & 6.710 & 25.82 & 39.31 & 1.749 & 4.007 \\
+        DSTAGNN\cite{DSTAGNN}    & 3.331 & 6.599 & 22.73 & 36.04 & 1.745 & 3.800 \\
+        GriddedTNP\cite{GriddedTNP} & 3.412 & 6.989 & 29.83 & 53.10 & 2.379 & 5.099 \\
+        EAC\cite{EAC}         & 3.532 & 6.915 & 26.61 & 40.23 & 1.834 & 4.045 \\
+        AutoSTF\cite{AutoSTF}    & 3.977 & 9.406 & 19.72 & 32.56 & 1.544 & 3.446 \\
+        Fredformer\cite{Fredformer} & 4.159 & 9.014 & 24.16 & 38.54 & 1.866 & 4.214 \\
+        ConvTimeNet\cite{Convtimenet} & 4.250 & 9.249 & 29.18 & 45.33 & 2.014 & 4.650 \\
+        LEAF\cite{LEAF}        & 4.407 & 9.989 & 28.49 & 43.17 & 1.886 & 4.101 \\
+        SRSNet\cite{SRSNet}     & 4.882 & 10.348& 32.12 & 48.80 & 2.163 & 4.923 \\
+        \bottomrule
+    \end{tabular}}
+\end{table}
+
+
+\begin{table}[t]
+    \centering
+\caption{Trajectory location prediction leaderboard on Foursquare\_NYC, Foursquare\_TKY, and Singapore.}
+
+    \label{tab:traj_leaderboard}
+    \resizebox{1\linewidth}{!}{
+    \begin{tabular}{l cc cc cc}
+        \toprule
+        \textbf{Model} &
+        \multicolumn{2}{c}{\textbf{Foursquare\_NYC}} &
+        \multicolumn{2}{c}{\textbf{Foursquare\_TKY}} &
+        \multicolumn{2}{c}{\textbf{Singapore}} \\
+        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
+        & Acc@1$\uparrow$ & Acc@5$\uparrow$
+        & Acc@1$\uparrow$ & Acc@5$\uparrow$
+        & Acc@1$\uparrow$ & Acc@5$\uparrow$ \\
+        \midrule
+        ROTAN\cite{ROTAN} & 0.1302 & 0.2805 & 0.1897 & 0.3653 & 0.1631 & 0.3331 \\
+        GNPRSID\cite{GNPRSID} & 0.1591 & 0.3419 & 0.1658 & 0.3746 & 0.1539 & 0.3471 \\
+        RNTrajRec\cite{RNTrajRec} & 0.1605 & 0.3231 & 0.1539 & 0.3305 & 0.1378 & 0.2978 \\
+        DeepMove\cite{DeepMove} & 0.1572 & 0.3739 & 0.1800 & 0.3869 & 0.1298 & 0.3096 \\
+        PLSPL\cite{PLSPL} & 0.1034 & 0.3211 & 0.1732 & 0.3596 & 0.1527 & 0.3294 \\
+        CANOE\cite{CANOE} & 0.1147 & 0.2883 & 0.1535 & 0.3485 & 0.1366 & 0.3089 \\
+        LoTNext\cite{LoTNext} & 0.0856 & 0.2402 & 0.1322 & 0.3890 & 0.1365 & 0.3576 \\
+        DCHL\cite{DCHL} & 0.1009 & 0.3141 & 0.0706 & 0.2507 & 0.0889 & 0.2678 \\
+        \bottomrule
+    \end{tabular}}
+\end{table}
+
+
+\begin{table}[t]
+    \centering
+    \caption{ETA prediction leaderboard on Beijing and Chengdu.}
+    \label{tab:eta_leaderboard}
+    \resizebox{\linewidth}{!}{
+    \begin{tabular}{l ccc ccc}
+        \toprule
+        \multirow{2}{*}{\textbf{Model}} &
+        \multicolumn{3}{c}{\textbf{Beijing}} &
+        \multicolumn{3}{c}{\textbf{Chengdu}} \\
+        \cmidrule(lr){2-4} \cmidrule(lr){5-7}
+        & MAE$\downarrow$ & MAPE$\downarrow$ & RMSE$\downarrow$
+        & MAE$\downarrow$ & MAPE$\downarrow$ & RMSE$\downarrow$ \\
+        \midrule
+        HetETA\cite{HetETA}    & 125.67 & 0.105 & 222.91 & 190.56 & 0.113 & 308.56 \\
+        DeepTTE\cite{Chengdu/DeepTTE}   & 224.46 & 0.208 & 351.74 & 317.38 & 0.220 & 429.09 \\
+        MVSTM\cite{MVSTM}     & 279.08 & 0.270 & 430.98 & 255.18 & 0.189 & 343.43 \\
+        MulT-TTE\cite{MulT-TTE} & 280.36 & 0.274 & 432.43 & 465.59 & 0.381 & 580.25 \\
+        DOT\cite{DOT}       & 364.85 & 0.382 & 547.62 & 209.74 & 0.163 & 286.02 \\
+        MetaTTE\cite{MetaTTE}   & 372.15 & 0.347 & 562.24 & 394.52 & 0.300 & 511.63 \\
+        DutyTTE\cite{DutyTTE}   & 431.59 & 0.460 & 572.96 & 243.13 & 0.171 & 443.44 \\
+        \bottomrule
+    \end{tabular}}
+\end{table}
+
+
+\begin{table}[t]
+    \centering
+    \caption{Map matching leaderboard on Santander, Spaichingen, Neftekamsk, and Valky.}
+    \label{tab:mm_leaderboard}
+    \resizebox{0.9\linewidth}{!}{
+    \begin{tabular}{l cc cc cc cc}
+        \toprule
+        \multirow{2}{*}{\textbf{Model}} &
+        \multicolumn{2}{c}{\textbf{Santander}} &
+        \multicolumn{2}{c}{\textbf{Spaichingen}} &
+        \multicolumn{2}{c}{\textbf{Neftekamsk}} &
+        \multicolumn{2}{c}{\textbf{Valky}} \\
+        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
+        & RMF$\downarrow$ & AL$\uparrow$
+        & RMF$\downarrow$ & AL$\uparrow$
+        & RMF$\downarrow$ & AL$\uparrow$
+        & RMF$\downarrow$ & AL$\uparrow$ \\
+        \midrule
+        FMM\cite{FMM}         & 0.018 & 1.000 & 0.000 & 1.000 & 0.852 & 0.193 & 0.329 & 0.671 \\
+        HMMM\cite{HMMM}        & 0.021 & 0.997 & 0.035 & 1.000 & 0.391 & 0.999 & 0.433 & 1.000 \\
+        STMatching\cite{STMatching}  & 0.674 & 0.998 & 0.088 & 1.000 & 0.457 & 1.000 & 0.436 & 1.000 \\
+        DeepMM\cite{DeepMM}      & 0.981 & 0.019 & 0.947 & 0.053 & 0.889 & 0.111 & 0.909 & 0.091 \\
+        L2MM\cite{L2MM}        & 1.132 & 0.057 & 1.632 & 0.158 & 0.778 & 0.222 & 2.455 & 0.182 \\
+        RLOMM\cite{RLOMM}       & 0.920 & 0.280 & 2.760 & 0.240 & 7.440 & 0.120 & 3.000 & 0.600 \\
+        \bottomrule
+    \end{tabular}}
+\end{table}
+\section{The AgentCity Benchmark}
+\label{sec:benchmark_release}
+
+\subsection{Benchmark Scope and Coverage}
+\label{sec:benchmark_scope}
+
+AgentCity supports a unified benchmark that spans multiple traffic prediction tasks and datasets.
+At the time of writing, the benchmark covers four representative traffic prediction tasks, including traffic state prediction, trajectory location prediction, ETA prediction, and map matching.
+Across these tasks, AgentCity aggregates a diverse collection of publicly available datasets and model implementations. 
+
+Table~\ref{tab:dataset_stats} summarizes the datasets included in the benchmark.
+In total, AgentCity covers 26 publicly available datasets across the four traffic prediction tasks.
+These datasets span heterogeneous spatial representations and temporal resolutions, including graph-based, grid-based, and origin--destination data for traffic state prediction, as well as trajectory datasets represented as variable-length sequences of locations or GPS points.
+For ETA prediction and map matching, the benchmark includes GPS trajectory datasets with different scales in terms of trajectory volume and network size.
+
+
+Table~\ref{tab:model_stats} summarizes the traffic prediction models currently included in AgentCity.
+For each task, the benchmark integrates a representative set of models that follow heterogeneous modeling assumptions and architectural designs.
+All models are reproduced and evaluated under unified task definitions and evaluation protocols, enabling consistent comparison within and across tasks.
+
+Across tasks, the benchmark includes datasets defined on sensor networks, region-based spatial partitions, road network graphs, and individual trajectories.
+Traffic state prediction datasets are typically defined on fixed sensor networks with regular temporal sampling, while trajectory-based datasets represent individual mobility as sequences of locations or GPS points.
+Map matching datasets are constructed on explicit road networks and focus on network-constrained trajectory inference.
+Together, these datasets capture both group-level and individual-level traffic dynamics under heterogeneous spatial settings.
+
+
+\subsection{Literature Coverage Analysis}
+\label{sec:literature_analysis}
+
+To characterize the literature coverage of the benchmark, we analyze the distribution of studies included through AgentCity across publication venues, years, and traffic prediction tasks.
+Figure~\ref{fig:analysis} summarizes these statistics based on the models that have been reproduced and integrated into the benchmark.
+
+In total, the benchmark includes 74 research papers published in recent years.
+These papers span multiple traffic prediction tasks, with 36 studies on traffic state prediction, 18 on trajectory location prediction, 11 on estimated time of arrival (ETA) prediction, and 9 on map matching.
+This task distribution reflects the relative research activity across different traffic prediction problems.
+
+The venue distribution indicates that many collected studies originate from major data mining and machine learning venues, with KDD representing the largest share.
+In addition, a notable portion of models are released through arXiv, reflecting active research activity beyond traditional conference venues.
+
+
+The year distribution indicates that most included studies were published between 2023 and 2025.
+This concentration reflects the recent growth of research activity in traffic prediction and related areas.
+These statistics provide a descriptive overview of the literature represented in the benchmark and clarify the scope of models evaluated in AgentCity.
+
+
+
+\subsection{Task-wise Leaderboards}
+\label{sec:leaderboards}
+
+
+This subsection presents representative leaderboard results for four core traffic prediction tasks under unified evaluation protocols.
+The reported results provide a task-wise view of model performance under consistent data processing, training, and evaluation settings.
+
+Traffic state prediction results are reported on METR\_LA, PEMSD7, and PEMS\_BAY; trajectory location prediction on Foursquare (NYC, TKY) and Singapore; ETA prediction on Beijing and Chengdu; and map matching on selected cities from the Global dataset.
+All models are evaluated within a unified framework, with hyperparameters systematically tuned via AgentCity.
+Training is controlled using early stopping based on validation loss, and the checkpoint with the best validation performance is selected for evaluation.
+
+Table~\ref{tab:task_dataset_overview} summarizes the datasets used in the reported benchmark results, together with their basic statistics and evaluation protocols.
+Tables~\ref{tab:traffic_leaderboard}--\ref{tab:mm_leaderboard} present the corresponding task-wise leaderboard results under consistent evaluation settings.
+
+For clarity and space considerations, we report results on a representative subset of widely used datasets and models for each task, following standard evaluation settings in prior studies.
+The complete benchmark results, covering additional datasets and model implementations, are available through the online leaderboard.
+
+\begin{figure}
+\centering
+\begin{subfigure}[b]{0.48\linewidth}
+\hspace{-3px}
+\includegraphics[width=\linewidth]{figures/Frontend.png}
+\caption{Benchmark Homepage}
+\end{subfigure}
+\hfill
+\begin{subfigure}[b]{0.48\linewidth}
+\hspace{-3px}
+\includegraphics[width=\linewidth]{figures/LeaderBoard.png}
+\caption{AgentCity Interface}
+\end{subfigure}
+\caption{The AgentCity platform.
+The benchmark homepage presents benchmark statistics and public leaderboards.
+The AgentCity interface provides an interactive environment for the agent-driven workflow.}
+
+\label{fig:AgentCity}
+\end{figure}
+\subsection{Benchmark Access and Usage}
+\label{sec:benchmark_access}
+
+The AgentCity benchmark is publicly accessible.
+Figure~\ref{fig:AgentCity} presents the project homepage and the AgentCity user interface, which together provide benchmark information, evaluation results, and guidance for executing the benchmark workflow with AgentCity.
+
+The project homepage introduces the overall scope of AgentCity, including the supported traffic prediction tasks, benchmark organization, and evaluation protocols.
+It provides documentation for installing and running AgentCity and presents detailed task-wise leaderboards that report benchmark results under unified evaluation settings.
+The AgentCity user interface allows users to interactively execute the benchmark construction workflow described in this paper.
+Through the interface, users can run the three stages of literature retrieval, model and data integration, and standardized evaluation, and examine the corresponding outputs.
+Execution logs, intermediate artifacts, and analysis results from each stage are displayed to support inspection of the benchmark process.
+
+Detailed usage instructions, task-wise leaderboards, and documentation of the unified evaluation framework are available through the project website and source code repository.\footnote{\fulllink}
+
+\begin{table}[t]
+    \centering
+    \caption{Comparison between reported results and reproduced results in terms of MAE and RMSE.}
+    \label{tab:mae_rmse_comparison}
+    \resizebox{\linewidth}{!}{%
+    \begin{tabular}{l l cc cc c}
+        \toprule
+        \multirow{2}{*}{\textbf{Model}} & \multirow{2}{*}{\textbf{Dataset}} & 
+        \multicolumn{2}{c}{\textbf{Paper Reported}} & 
+        \multicolumn{2}{c}{\textbf{Reproduced}} & 
+        \multirow{2}{*}{\textbf{Gap (\%)}} \\
+        \cmidrule(lr){3-4} \cmidrule(lr){5-6}
+        & & MAE & RMSE & MAE & RMSE & \\
+        \midrule
+        DSTAGNN & PEMSD4   & 19.30 & 31.46 & 19.90 & 31.29 & 0.85 \\
+        LightST & PEMSD7   & 20.78 & 33.95 & 21.99 & 34.59 & 3.38 \\
+        RSTIB   & PEMSD7   & 19.84 & 33.90 & 20.37 & 33.40 & 0.06 \\
+        STDMAE  & METR\_LA & 3.00  & 5.98  & 3.09  & 6.23  & 3.79 \\
+        LSTTN   & METR\_LA & 2.96  & 5.92  & 3.08  & 6.12  & 3.60 \\
+        AutoSTF & PEMS\_BAY & 1.55  & 3.51  & 1.54  & 3.44  & -1.58 \\
+        DCST    & PEMS\_BAY & 1.55  & 3.50  & 1.56  & 3.48  & -0.20 \\
+        \bottomrule
+    \end{tabular}%
+    }
+\end{table}
+
+
+\begin{table*}[t]
+    \centering
+    \caption{Comparison of reproduction consistency between AgentCity and other code-oriented agents.}
+    \label{tab:selected_models}
+    \resizebox{0.8\linewidth}{!}{
+    \begin{tabular}{l ccc ccc ccc ccc}
+        \toprule
+        \multirow{2}{*}{\textbf{Source}} & 
+        \multicolumn{3}{c}{\textbf{STDMAE(PEMSD7)}} & 
+        \multicolumn{3}{c}{\textbf{LightST(PEMSD7)}} & 
+        \multicolumn{3}{c}{\textbf{LSTTN(METR\_LA)}} & 
+        \multicolumn{3}{c}{\textbf{DSTAGNN(PEMSD4)}} \\
+        \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10} \cmidrule(lr){11-13}
+        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$
+        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ 
+        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ 
+        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ \\
+        \midrule
+        Reported~(Paper) &
+        18.65 & 31.44 & 0.00 & 
+        20.78 & 33.95 & 0.00 & 
+        2.96 & 5.92 & 0.00 & 
+        19.30 & 31.46 & 0.00 \\
+        SWE-agent & 
+        31.96 & 45.87 & 55.38 & 
+        22.21 & 34.76 & 4.09 & 
+        4.50 & 9.84 & 61.49 & 
+        20.11 & 31.48 & 1.64 \\
+        OpenHands & 
+        21.79 & 34.55 & 12.48 & 
+        26.18 & 38.89 & 18.89 & 
+        6.55 & 11.80 & 106.64 & 
+        20.27 & 31.97 & 2.91 \\
+        \textbf{AgentCity} & 
+        \textbf{20.19} & \textbf{32.99} & \textbf{6.17} & 
+        \textbf{21.99} & \textbf{34.59} & \textbf{3.38} & 
+        \textbf{3.08} & \textbf{6.12} & \textbf{3.60} & 
+        \textbf{19.90} & \textbf{31.29} & \textbf{0.85} \\
+        \bottomrule
+    \end{tabular}}
+\end{table*}
+
+
+\section{Benchmark Validation}
+\label{sec:validation}
+
+\subsection{Reproduction Fidelity}
+\label{sec:fidelity}
+
+We evaluate the reproduction fidelity of AgentCity by comparing reproduced results with the metrics reported in the original papers.
+This analysis examines whether AgentCity reproduces results that are consistent with those reported in prior studies.
+
+
+We focus on the traffic state prediction task, which has well-established datasets and evaluation protocols and is commonly used in the literature.
+Seven representative models are selected for analysis, covering different architectural designs and training strategies.
+For each model--dataset pair, we report the MAE and RMSE values stated in the original paper together with the corresponding results reproduced by AgentCity.
+The relative gap between reported and reproduced results is summarized in Table~\ref{tab:mae_rmse_comparison}.
+
+Across the examined models and datasets, the reproduced results are generally close to the reported values.
+Differences between reproduced results and reported values can arise from software and hardware environments, nondeterministic training behavior, and minor implementation variations.
+All results are obtained using a consistent reproduction and evaluation process without manual intervention, indicating that AgentCity reproduces published traffic prediction models with reasonable fidelity.
+
+\subsection{Reproduction Consistency Across Code Agents}
+\label{sec:agent_comparison}
+
+We compare the reproduction results obtained by AgentCity with those produced by two general-purpose code-oriented agents, SWE-agent~\cite{Swe-agent} and OpenHands~\cite{OpenHands}.
+The comparison examines reproduction consistency, defined as how closely reproduced results match the metrics reported in the original papers.
+
+All agents are evaluated under the same reproduction setting with Claude-4.5-Opus as the underlying language model, operate on the same code repositories and datasets, and follow the same reproduction objective of matching reported MAE and RMSE values.
+The prompts used to specify reproduction tasks are identical across agents and are described in Appendix~\ref{Model Adapter}.
+Each agent is allowed to iteratively execute, debug, and rerun code until a valid training and evaluation pipeline is completed.
+No manual intervention or task-specific adjustment is performed for any agent during the reproduction process.
+Table~\ref{tab:selected_models} summarizes the reproduction results.
+For each model--dataset pair, the table reports the metrics stated in the original paper together with the reproduced MAE, RMSE, and relative gaps.
+Across the evaluated cases, AgentCity produces reproduced results that are closer to the reported values than those obtained by the other agents under the same reproduction setting.
+\section{Related Work}
+
+
+\subsection{Traffic Prediction Benchmarks}
+
+Benchmark research in traffic prediction has progressed from unified deep learning toolkits toward more diverse evaluation settings.
+Early benchmarks such as LibCity~\cite{Libcity}, DL-Traff~\cite{Dl-traff}, and TorchSpatial~\cite{Torchspatial} focus on standardizing data processing, task definitions, and evaluation protocols for traffic prediction models, providing a common basis for reproducible comparison of predictive performance.
+More recent efforts, including CityBench~\cite{CityBench}, STBench~\cite{STBench}, and USTBench~\cite{USTBench}, extend benchmarking beyond predictive accuracy to assess semantic understanding, reasoning, and planning capabilities of general-purpose models in urban and transportation scenarios.
+Despite this progress, most existing traffic prediction benchmarks are constructed and maintained through largely manual processes.
+The automation and continuous maintenance of the benchmarking workflow remain insufficiently addressed.
+
+
+\subsection{LLM Agents for Automated Reproduction and Benchmarking}
+
+Recent advances in large language model (LLM) agents have enabled tighter coupling between natural language reasoning and automated code generation in scientific workflows.
+General-purpose frameworks such as SWE-agent~\cite{Swe-agent} and OpenHands~\cite{OpenHands} demonstrate the ability to navigate and modify complex code repositories, while more specialized systems, including ML-Master~\cite{ML-Master} and PiML~\cite{PiML}, focus on automating and optimizing machine learning pipelines.
+Building on these capabilities, research-oriented agents such as DeepCode~\cite{DeepCode}, Paper2Code~\cite{Paper2code}, and Agent Laboratory~\cite{Agentlaboratory} aim to support broader stages of the scientific process, ranging from algorithm understanding to experiment execution and reproduction~\cite{Autoreproduce}.
+Despite this progress, most existing LLM-based agents are designed for general-purpose code interaction and research automation.
+Their workflows do not explicitly account for the domain-specific requirements of traffic and spatiotemporal reproduction, such as heterogeneous data organization, task-specific preprocessing pipelines, and structured spatial representations.
+
+\section{Conclusion}
+
+In this work, we present AgentCity, an AI-maintained framework for the continuous construction and evaluation of traffic prediction benchmarks.
+AgentCity formulates benchmark maintenance as a structured, agent-driven workflow that automates literature retrieval, model and data integration, and standardized evaluation under unified protocols, including systematic hyperparameter tuning, enabling benchmark construction to be treated as an ongoing and scalable process rather than a one-time manual effort.
+Built on this framework, we release a publicly accessible traffic prediction benchmark that spans multiple representative tasks, integrates diverse datasets and model implementations, and provides task-wise leaderboards under consistent evaluation settings.
+We further validate the reliability of the framework by comparing reproduced results with those reported in original papers and with results obtained by general-purpose code-oriented agents under the same reproduction settings, demonstrating stable and consistent reproduction performance.
+AgentCity enables continuous and scalable maintenance of traffic prediction benchmarks under unified evaluation protocols, providing a reproducible basis for integrating and evaluating models as the benchmark evolves.