LLMver_v1

2026-03-20 22:40:13 +08:00
parent cf02f82db0
commit cacdc79ae2
75 changed files with 2553 additions and 10203 deletions
--- a/mypaper/KDD2026_AgentCity.bib
+++ b/mypaper/KDD2026_AgentCity.bib
--- a/mypaper/KDD2026_AgentCity.tex
+++ b/mypaper/KDD2026_AgentCity.tex
@@ -1,825 +0,0 @@
-\title[AgentCity: An AI-Maintained Continuous Benchmark for Traffic Prediction]{AgentCity: An AI-Maintained Continuous Benchmark \\ for Traffic Prediction}
-
-\input{misc}
-
-
-\section{Introduction}
-\begin{figure}[t]
-    \centering
-    \includegraphics[width=1.\linewidth]{assets/Agent_Promo_NG.png}
-\caption{AgentCity: A multi-agent system for continuous traffic prediction benchmarking.}
-    \label{fig:placeholder}
-\end{figure}
-
-Traffic prediction is a fundamental component of data-driven intelligent transportation systems, supporting a wide range of applications such as traffic management, route planning, mobility analysis, and urban decision-making.
-In recent years, advances in deep learning have led to a rapid growth of traffic prediction models, covering diverse tasks including traffic state prediction, trajectory forecasting, travel time estimation, and map matching. These models vary substantially in architectural design, modeling assumptions, data requirements, and evaluation settings.
-
-Consequently, benchmarks play a critical role in enabling systematic evaluation. By providing standardized datasets, clearly defined tasks, and consistent evaluation protocols, benchmarks allow fair and reproducible comparison of model performance across studies, and support empirical analysis within the research community. To this end, several benchmarking frameworks have been proposed for traffic and spatiotemporal prediction. Representative examples include DL-Traff~\cite{Dl-traff}, LibCity~\cite{Libcity}, and TorchSpatial~\cite{Torchspatial}, which aim to standardize data preprocessing, task definitions, and evaluation pipelines across a range of prediction tasks. These efforts establish a more consistent basis for empirical comparison.
-
-However, existing benchmarks share a fundamental limitation: they rely on \textbf{manual, human-centered maintenance}, which introduces several structural challenges.
-
-First, \textbf{\emph{limited scalability}} constrains benchmark coverage. The traffic prediction literature continues to expand rapidly, with a large number of new models published each year. These models are implemented using diverse frameworks, code structures, and data interfaces, making their manual integration into a unified benchmark labor-intensive and difficult to sustain at scale. As a result, benchmark coverage often lags behind recent research progress.
-
-Second, \textbf{\emph{static evaluation pipelines}} limit continuous assessment. Most existing benchmarks are built upon fixed datasets and evaluation procedures, whereas real-world transportation systems evolve continuously, with changes in road networks, travel demand, and mobility patterns. Although some datasets are periodically updated, incorporating these updates into existing benchmarks typically requires additional manual effort, limiting long-term and continuous evaluation.
-
-Third, \textbf{\emph{inconsistent evaluation settings}} weaken result comparability. Results reported in original papers are often obtained under carefully tuned configurations tailored to specific datasets and tasks, while benchmark implementations typically rely on default or minimally tuned settings. This difference can lead to deviations from reported results and reduces the benchmark’s reliability as a fair reference for model assessment.
-
-
-Together, these challenges indicate that a key limitation of traffic prediction benchmarking is no longer the absence of standardized frameworks, but the lack of a \emph{continuous}, \emph{scalable}, and \emph{consistently evaluated} maintenance mechanism that treats benchmark construction as an ongoing process rather than a one-time effort.
-
-In this work, we propose \textbf{AgentCity}, an \textbf{AI-maintained} framework for the continuous construction and evaluation of traffic prediction benchmarks.
-AgentCity replaces manual, human-centered benchmark maintenance with an automated pipeline that systematically retrieves recent literature, integrates external model and dataset implementations, and evaluates models under unified and consistent protocols.
-
-
-AgentCity structures benchmark maintenance as a coordinated workflow consisting of three core components: \emph{literature retrieval}, \emph{model and data integration}, and \emph{standardized evaluation}.
-These components respectively support the automated discovery of relevant studies, the reproduction and integration of external models and datasets into a unified evaluation framework, and the fair assessment of models under consistent data processing, training, and evaluation settings.
-Within this process, controlled hyperparameter tuning is applied to each model on each task under a unified protocol, ensuring fair and comparable evaluation.
-The overall workflow is coordinated by a multi-agent system, enabling scalable and robust benchmark maintenance over time.
-
-
-
-Built upon AgentCity, we construct a continuously evolving traffic prediction benchmark that currently aggregates 74 representative models across multiple tasks and datasets.
-All models are evaluated using unified evaluation protocols, enabling reproducible and comparable assessment across methods.
-The AgentCity framework and benchmark are publicly available online, with configurations and evaluation results for reproducibility.
-
-
-Our main contributions are summarized as follows:
-\begin{itemize}[leftmargin=*, topsep=0pt]
-    \item We propose \textbf{AgentCity}, the first \textbf{AI-maintained} framework designed for continuous construction and evaluation of \textbf{traffic prediction benchmarks}.
-    \item We develop a multi-agent workflow that automates key benchmark maintenance processes, including literature retrieval, model and data integration, and standardized evaluation.
-    \item We release a large-scale, continuously updated traffic prediction benchmark and public leaderboard built upon AgentCity, supporting reproducible evaluation across tasks and datasets.
-\end{itemize}
-
-
-\begin{table*}[ht]
-\centering
-\caption{Categorization of traffic-related data and their typical representations.}
-\label{tab:st_data_abstraction}
-\resizebox{0.9\linewidth}{!}{
-\begin{tabular}{c c c c}
-\toprule
-\textbf{Data Group} &
-\textbf{Data Category} &
-\textbf{Description} &
-\textbf{Typical Data Form} \\
-\midrule
-\multirow{2}{*}{Static Spatial Structure}
-& Geographical Units &
-Geographical entities defining the spatial domain. &
-$N \times D$ \\
-& Unit Relations &
-Structured relations between spatial units. &
-$N \times N$ \\
-\midrule
-\multirow{3}{*}{Group-level Spatiotemporal Dynamics}
-& Unit-level Dynamics &
-Time-varying attributes defined on spatial units. &
-$T \times N \times D$ \\
-& Grid-level Dynamics &
-Time-varying attributes defined on spatial regions. &
-$T \times I \times J \times D$ \\
-& Origin--Destination Dynamics &
-Time-varying interactions between spatial unit pairs. &
-$T \times N \times N \times D$ \\
-\midrule
-Individual Trajectory Dynamics
-& Trajectory Data &
-Ordered temporal sequences of spatial states. &
-$\{(x_i, t_i)\}_{i=1}^{L}$ \\
-\bottomrule
-\end{tabular}
-}
-\end{table*}
-
-
-\begin{table*}[ht]
-\centering
-\caption{Categorization of traffic prediction tasks and their input--output data categories.}
-\label{tab:task_summary}
-\resizebox{\linewidth}{!}{
-\begin{tabular}{c c c c}
-\toprule
-\textbf{Task} &
-\textbf{Input Data Category} &
-\textbf{Output Data Category} &
-\textbf{Typical Data Form} \\
-\midrule
-Traffic State Prediction &
-Group-level Dynamics \,+\, Unit Relations  &
-Future Unit-level Dynamics &
-$X \in \mathbb{R}^{T_{\text{in}} \times N \times D},\;
-y \in \mathbb{R}^{T_{\text{out}} \times N \times D}$ \\
-\midrule
-Trajectory Location Prediction &
-Trajectory Data \,+\, Geographical Units &
-Next Trajectory Location &
-$[loc_1, \ldots, loc_n] \rightarrow loc_{n+1}$ \\
-\midrule
-ETA Prediction &
-Trajectory Data &
-Travel Time &
-$\{(x_i, t_i)\}_{i=1}^{L} \rightarrow \Delta t$ \\
-\midrule
-Map Matching &
-Trajectory Data \,+\, Geographical Units \,+\, Unit Relations &
-Road segment sequence &
-$\{(lon_i, lat_i, t_i)\}_{i=1}^{L} \rightarrow \{r_j\}_{j=1}^{K}$ \\
-\bottomrule
-\end{tabular}
-}
-\end{table*}
-
-
-\section{Traffic Prediction Data and Tasks}
-\label{sec:background}
-
-This section introduces a unified abstraction of data types and prediction tasks commonly studied in traffic prediction, highlighting the diversity of data organizations and task interfaces that characterize existing traffic prediction benchmarks.
-
-\label{sec:background}
-\subsection{Traffic-Related Data Categories}
-Traffic prediction data differ from homogeneous modalities such as images or text by combining spatial entities, relational structures, and time-indexed observations.
-In traffic scenarios, these data can be broadly categorized into three groups: static spatial structure, group-level traffic dynamics, and individual trajectory dynamics.
-
-
-\paratitle{Static Spatial Structure.}
-Static spatial structure describes the fixed spatial context of a traffic system.
-It includes geographical units that define the spatial domain, such as sensors, road segments, or regions, as well as structured relations between these units, such as network connectivity or adjacency.
-This category provides the spatial foundation upon which traffic observations are organized.
-
-\paratitle{Group-level Traffic Dynamics.}
-Group-level traffic dynamics capture time-varying attributes defined over spatial units or their relations, including traffic speed, flow, or density measured at sensors or regions.
-Such data are usually represented as time-indexed tensors defined on nodes, grids, or origin--destination pairs.
-
-
-\paratitle{Individual Trajectory Dynamics.}
-Individual trajectory dynamics describe fine-grained mobility behavior of individual trips, represented as spatiotemporal state sequences.
-
-Table~\ref{tab:st_data_abstraction} summarizes these data categories and their typical representations.
-Throughout this paper, $N$ denotes the number of spatial units, $T$ the number of time steps, $D$ the feature dimension, $I$ and $J$ the numbers of grid rows and columns, and $L$ the trajectory length.
-\subsection{Traffic Prediction Tasks}
-Based on the data categories above, we consider four representative traffic prediction tasks with different data categories and input--output structures, as summarized in Table~\ref{tab:task_summary}.
-
-
-\paratitle{Traffic state prediction}
-forecasts future traffic dynamics over a fixed set of spatial units.
-The input consists of historical group-level dynamics,
-$X \in \mathbb{R}^{T_{\text{in}} \times N \times D}$,
-and the output is a sequence of future unit-level dynamics,
-$y \in \mathbb{R}^{T_{\text{out}} \times N \times D}$.
-
-\paratitle{Trajectory location prediction}
-focuses on next-step prediction for individual trajectories.
-Given a historical trajectory represented as an ordered sequence of locations $[loc_1, \ldots, loc_n]$, the task predicts the next location $loc_{n+1}$.
-The input trajectories are variable in length, and the outputs are discrete spatial states.
-
-
-\paratitle{Estimated time of arrival (ETA) prediction} aims to estimate the travel duration of a trajectory.
-The input is an individual trajectory represented as a sequence of spatiotemporal points
-$\{(x_i, t_i)\}_{i=1}^{L}$,
-and the output is a scalar value representing the estimated travel time.
-
-\paratitle{Map matching}
-aims to infer the most likely network-constrained path that corresponds to an observed trajectory.
-Given noisy or sparse trajectory observations, the task outputs an ordered sequence of road segments that is consistent with the underlying network topology.
-
-\section{Methodology}
-\subsection{Overview}
-\label{sec:overview}
-
-AgentCity is a multi-agent framework designed to support the continuous construction and evaluation of traffic prediction benchmarks.
-Built on top of LibCity~\cite{Libcity}, AgentCity enables the automated discovery, reproduction, and evaluation of traffic prediction models under unified task definitions and evaluation protocols.
-Given user-specified keywords and constraints, the system incrementally identifies relevant studies, integrates their models and associated datasets, and evaluates them in a consistent manner.
-
-As illustrated in Figure~\ref{fig:overview}, AgentCity organizes the overall process into three sequential stages: \emph{Literature Retrieval}, \emph{Model and Data Integration}, and \emph{Standardized Evaluation}.
-Each stage is managed by a dedicated \emph{Stage Leader Agent}, which is responsible for planning the stage workflow, coordinating specialized \emph{Subagents}, and validating intermediate results.
-Literature Retrieval focuses on identifying relevant models within a controlled search scope.
-Model and Data Integration handles the reproduction and adaptation of external model implementations and datasets into unified task interfaces.
-Standardized Evaluation assesses all integrated models under consistent data processing, training, and evaluation settings.
-
-To accommodate heterogeneous implementations and incomplete specifications commonly found in research code, AgentCity supports iterative refinement within each stage.
-When intermediate results do not satisfy predefined validation criteria, the corresponding Stage Leader Agent selectively re-invokes relevant Subagents to refine the outcome, with explicit limits on the number of iterations.
-
-Artifacts produced at each stage, including structured metadata, configuration files, and validation summaries, are recorded and propagated across stages by a Global Coordinator.
-This allows subsequent stages to operate based on established information while maintaining a clear separation of responsibilities.
-Together, these components form a structured workflow that enables scalable and reproducible benchmark construction for traffic prediction.
-
-\begin{figure*}
-    \centering
-    \includegraphics[width=0.8\linewidth]{agentv2.pdf}
-\caption{AgentCity framework overview.
-Benchmark construction is organized into three stages: Literature Retrieval, Model and Data Integration, and Standardized Evaluation.
-Each stage is coordinated by a Leader Agent that invokes specialized Subagents to perform stage-specific operations.
-}
-
-    \label{fig:overview}
-\end{figure*}
-
-
-
-\subsection{Stage I: Literature Retrieval}
-\label{sec:literature}
-The Literature Retrieval stage collects research work related to a given traffic prediction task and produces a structured set of candidate models for downstream integration and evaluation.
-This stage defines a documented search and filtering procedure and records the resulting candidates and associated metadata.
-It is managed by a \emph{Retrieval Leader Agent}, which coordinates multiple Subagents to perform concrete operations.
-
-\paratitle{Paper Searcher.}
-The Paper Searcher retrieves candidate papers using keyword-based queries derived from user input or a predefined set of task-specific keywords.
-Additional constraints, such as publication venues or time ranges, can be specified to delimit the search scope.
-This step collects studies related to the target traffic prediction task across different modeling approaches.
-
-\paratitle{Paper Evaluator.}
-The Paper Evaluator examines each retrieved paper to determine whether it provides the information required for subsequent model and data integration.
-The evaluation checks whether the paper specifies the prediction task, model formulation, input--output definitions, experimental setup, and evaluation metrics.
-Papers that lack information required for model implementation, data preparation, or evaluation are excluded at this stage.
-
-\paratitle{Paper Analyzer.}
-For papers retained after evaluation, the Paper Analyzer extracts information needed for later stages.
-This includes references to model architectures, code repositories, descriptions of datasets and preprocessing steps, training and evaluation settings, and reported metrics.
-The extracted information is organized into a structured representation for use in model and data integration.
-
-\paratitle{Stage execution.}
-The Retrieval Leader Agent executes the search, evaluation, and analysis steps in sequence.
-When the resulting paper set does not satisfy predefined criteria, such as coverage of the target task or completeness of extracted metadata, the leader reviews the execution outcomes and re-executes the relevant steps.
-The output of this stage is a structured collection of candidate models and associated metadata, which is passed to the subsequent integration stage.
-
-
-
-
-\subsection{Stage II: Model and Data Integration}
-\label{sec:migration}
-
-The Model and Data Integration stage reproduces external traffic prediction models together with their associated datasets and aligns them with unified task interfaces for evaluation.
-This stage transforms heterogeneous research implementations into executable benchmark components that follow consistent data organization, training procedures, and evaluation protocols.
-It is coordinated by a \emph{Integration Leader Agent}, which manages a set of Subagents responsible for concrete integration steps.
-
-\paratitle{Source Collector.}
-The Source Collector retrieves the resources required for reproduction, including model implementations, configuration files, and dataset references extracted in Stage~I.
-It analyzes the structure of the retrieved codebase to identify model definitions, training pipelines, data loading logic, and external dependencies.
-The collected sources serve as the basis for subsequent integration.
-
-\paratitle{Model and Data Adapter.}
-The Model and Data Adapter performs the core integration work.
-For models, it aligns architecture definitions, input--output formats, and training interfaces with the benchmark’s task specifications.
-For datasets, it handles dataset acquisition, preprocessing alignment, feature construction, and data split configuration according to the benchmark protocol.
-
-\paratitle{Configuration Assembler.}
-The Configuration Assembler constructs unified configuration files that combine model settings, dataset parameters, and training options.
-Reported hyperparameters and experimental settings from the original paper are incorporated when available.
-When details are unspecified, task-consistent defaults defined by the benchmark are applied.
-The resulting configurations define a complete and executable evaluation setup.
-
-\paratitle{Integration Validator.}
-The Integration Validator executes a validation run using the assembled model and dataset configuration.
-It verifies model initialization, data loading, and basic training execution, and records logs to assess integration completeness.
-
-
-
-\paratitle{Stage execution.}
-The Integration Leader Agent executes source collection, adaptation, configuration assembly, and validation in sequence, and re-invokes relevant Subagents when validation criteria are not satisfied.
-The output of this stage is an executable model--dataset pair together with structured configurations and validation records, which are passed to the evaluation stage.
-
-
-
-\begin{figure*}
-    \centering
-    \includegraphics[width=1\linewidth]{pie_combined.png}
-\caption{Distributions of studies included in the benchmark.
-The figure shows the distribution of collected papers by publication venue (left), publication year (middle), and traffic prediction task (right).}
-    \label{fig:analysis}
-\end{figure*}
-
-\subsection{Stage III: Standardized Evaluation}
-\label{sec:evaluation}
-
-The Standardized Evaluation stage evaluates integrated traffic prediction models under unified training and evaluation protocols to produce comparable performance results across models.
-It is coordinated by an \emph{Evaluation Leader Agent}, which oversees a small set of Subagents responsible for execution and result aggregation.
-
-\paratitle{Evaluation Planner.}
-The Evaluation Planner specifies the evaluation configuration for each model--task pair, including training settings, evaluation metrics, and the hyperparameter ranges defined by the benchmark protocol.
-
-\paratitle{Evaluation Executor.}
-The Evaluation Executor runs model training and evaluation using the specified configurations.
-During execution, it records performance metrics, training dynamics, and runtime information required for result reporting and analysis.
-
-\paratitle{Result Collector.}
-The Result Collector aggregates evaluation outputs across runs, identifies the best-performing configurations according to task-specific metrics, and organizes the results into standardized records for benchmarking.
-
-\paratitle{Stage execution.}
-The Evaluation Leader Agent coordinates planning, execution, and result collection, and re-invokes relevant steps when evaluation results are invalid or incomplete.
-The output of this stage is a set of standardized evaluation results that can be directly compared across models.
-
-\subsection{Implementation Details}
-\label{sec:implementation}
-
-AgentCity is implemented as a coordinated multi-agent system centered around a \emph{Global Coordinator}.
-The coordinator maintains a shared execution context and dispatches stage-specific \emph{Leader Agents} to execute the three benchmark stages in sequence.
-Each Leader Agent manages its workflow by invoking Subagents, validating intermediate outputs, and controlling stage execution.
-
-\paratitle{Agent Coordination and Control.}
-Leader Agents follow a unified control pattern, decomposing each stage into executable steps, invoking Subagents for concrete operations, and collecting structured outputs.
-Subagents encapsulate task-specific functions such as literature querying, source acquisition, code adaptation, dataset preparation, model execution, and result aggregation.
-
-
-\paratitle{Cross-Stage Context Propagation.}
-The Global Coordinator maintains a shared execution context that records structured artifacts produced at each stage.
-These artifacts are propagated across stages to support subsequent execution without repeating earlier steps.
-
-
-\paratitle{Model Backend Configuration.}
-Different language model backends can be assigned to agents according to task requirements.
-Code-related and diagnostic tasks use more capable backends, while routine operations may use lighter-weight ones.
-Backend selection is specified through system configuration and is independent of the overall workflow structure.
-
-
-\begin{table}[t]
-    \centering
-    \caption{Traffic prediction datasets in AgentCity.}
-    \label{tab:dataset_stats}
-    \resizebox{\linewidth}{!}{
-    \begin{tabular}{l p{7cm}}
-        \toprule
-        \textbf{Task} & \textbf{Dataset} \\
-        \midrule
-        Traffic State Prediction & METR-LA\cite{METR_LA/PEMS_BAY}, PEMSD7(M)\cite{PEMSD7M}, PEMS-BAY\cite{METR_LA/PEMS_BAY}, PEMSD3\cite{PEMSD3/7}, PEMSD4\cite{PEMSD4/8}, PEMSD7\cite{PEMSD3/7}, PEMSD8\cite{PEMSD4/8}, TAXIBJ\cite{TaxiBJ}, T-DRIVE\cite{T-drive}, NYCTaxi\cite{NYCTaxi/Bike}, NYCBike\cite{NYCTaxi/Bike}, LargeST\cite{LargeST}  \\
-        Traj. Loc. Prediction & Gowalla\cite{Gowalla/BrightKite}, Foursquare-TKY\cite{Foursquare-NYC/TKY}, Foursquare-NYC\cite{Foursquare-NYC/TKY}, BrightKite\cite{Gowalla/BrightKite}, Instagram\cite{Instagram}, Singapore\cite{Singapore}, Porto\cite{Porto}    \\
-        ETA Prediction & Chengdu\cite{Chengdu/DeepTTE}, Beijing\cite{Beijing/TTPNet}, Porto\cite{Porto}, NYCTaxi\cite{NYCTaxi/Bike}, NYCBike\cite{NYCTaxi/Bike}    \\
-        Map Matching & Global\cite{Global}(Neftekamsk, Ruzhany, Spaichingen, Valky), Seattle\cite{Seattle}   \\
-        \bottomrule
-    \end{tabular}}
-\end{table}
-
-\begin{table}[t]
-    \centering
-    \caption{Traffic prediction models in AgentCity.}
-    \label{tab:model_stats}
-    \resizebox{\linewidth}{!}{
-    \begin{tabular}{l p{9cm}}
-        \toprule
-        \textbf{Task} & \textbf{Model} \\
-        \midrule
-        Traffic State Prediction & 
-        STSSDL\cite{STSSDL}, STAEformer\cite{STAEformer}, AutoSTF\cite{AutoSTF}, STDMAE\cite{STDMAE}, 
-        EAC\cite{EAC}, GriddedTNP\cite{GriddedTNP}, PatchSTG\cite{PatchSTG}, SRSNet\cite{SRSNet}, 
-        FlashST\cite{FlashST}, ConvTimeNet\cite{Convtimenet}, Fredformer\cite{Fredformer}, Pathformer\cite{Pathformer}, 
-        HTVGNN\cite{HTVGNN}, PatchTST\cite{PatchTST}, DCST\cite{DCST}, STLLM\cite{STLLM}, 
-        T-graphormer\cite{T-graphormer}, CKGGNN\cite{CKGGNN}, EasyST\cite{EasyST}, LEAF\cite{LEAF}, 
-        MetaDG\cite{MetaDG}, TRACK\cite{TRACK}, HiMSNet\cite{HiMSNet}, DST2former\cite{DST2former}, 
-        DSTMamba\cite{DSTMamba}, BigST\cite{BigST}, ASeer\cite{ASeer}, STHSepNet\cite{STHSepNet}, 
-        STWave\cite{STWave}, HSTWAVE\cite{HSTWAVE}, DSTAGNN\cite{DSTAGNN}, RSTIB\cite{RSTIB}, 
-        LSTTN\cite{LSTTN}, LightST\cite{LightST}, TimeMixer++\cite{TimeMixer++}, STID\cite{STID}, UniST\cite{UniST} \\
-        Traj. Loc. Pred & 
-        DeepMove\cite{DeepMove}, PLMTrajRec\cite{PLMTrajRec}, START\cite{START}, LoTNext\cite{LoTNext}, 
-        RNTrajRec\cite{RNTrajRec}, CoMaPOI\cite{CoMaPOI}, JGRM\cite{JGRM}, TrajSDE\cite{TrajSDE}, 
-        DCHL\cite{DCHL}, GNPRSID\cite{GNPRSID}, PLSPL\cite{PLSPL}, GETNext\cite{GETNEXT}, 
-        CANOE\cite{CANOE}, TPG\cite{TPG}, CLSPRec\cite{CLSPRec}, AGRAN\cite{AGRAN}, 
-        LightPath\cite{LightPath}, ROTAN\cite{ROTAN}, FPMC\cite{FPMC}, PRME\cite{PRME} \\
-        ETA Prediction & 
-        DOT\cite{DOT}, MetaTTE\cite{MetaTTE}, MVSTM\cite{MVSTM}, DutyTTE\cite{DutyTTE}, 
-        TTPNet\cite{TTPNet}, MTSTAN\cite{MTSTAN}, MulT-TTE\cite{MulT-TTE}, MDTI\cite{MDTI}, 
-        ProbETA\cite{ProbETA}, HierETA\cite{HierETA}, HetETA\cite{HetETA} \\
-        Map Matching & 
-        DeepMM\cite{DeepMM}, GraphMM\cite{GraphMM}, DiffMM\cite{DiffMM}, TRMMA\cite{TRMMA}, 
-        L2MM\cite{L2MM}, RLOMM\cite{RLOMM}, FMM\cite{FMM}, HMMM\cite{HMMM}, STMatching\cite{STMatching} \\
-        \midrule
-    \end{tabular}}
-\end{table}
-
-\begin{table*}[t]
-\centering
-\caption{Task-wise datasets, data scale statistics, and evaluation metrics used in the benchmark.
-$N$, $E$, and $U$ denote the numbers of nodes, edges, and users, respectively.
-$T$ denotes the total volume of data records, corresponding to the accumulated traffic flow observations for Traffic State Prediction and the total number of trajectory points or check-ins for the other tasks.}
-
-\label{tab:task_dataset_overview}
-\resizebox{\linewidth}{!}{
-\begin{tabular}{l l l l l}
-\toprule
-\textbf{Task} & \textbf{Dataset} & \textbf{Scale ($N/E/U/T$)}
- & \textbf{Time Span} & \textbf{Metrics} \\
-\midrule
-
-\multirow{3}{*}{Traffic State Prediction}
-& METR-LA
-& $N{=}207$, $E{=}11{,}753$, $T{=}7.1$M
-& Mar. 2012 -- Jun. 2012
-& MAE$\downarrow$, RMSE$\downarrow$ \\
-
-& PEMSD7
-& $N{=}228$, $E{=}51{,}984$, $T{=}2.9$M
-& May. 2017 -- Aug. 2017
-& MAE$\downarrow$, RMSE$\downarrow$ \\
-
-& PEMS-BAY
-& $N{=}325$, $E{=}8{,}358$, $T{=}16.9$M
-& Jan. 2017 -- Jun. 2017
-& MAE$\downarrow$, RMSE$\downarrow$ \\
-
-\midrule
-
-\multirow{3}{*}{Trajectory Location Prediction}
-& Foursquare\_NYC
-& $N{=}38{,}332$, $U{=}1{,}082$, $T{=}227$K
-& Apr. 2012 -- Feb. 2013
-& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
-
-& Foursquare\_TKY
-& $N{=}61{,}857$, $U{=}2{,}292$, $T{=}574$K
-& Apr. 2012 -- Feb. 2013
-& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
-
-& Singapore
-& $N{=}20{,}153$, $U{=}17{,}744$, $T{=}696$K
-& Jan. 2017 -- Jun. 2017
-& Acc@1$\uparrow$, Acc@5$\uparrow$ \\
-
-\midrule
-
-\multirow{2}{*}{ETA Prediction}
-& Beijing
-& $N{=}16{,}383$, $U{=}76$, $T{=}518$K
-& Oct. 2013
-& MAE$\downarrow$, MAPE$\downarrow$, RMSE$\downarrow$ \\
-
-& Chengdu
-& $N{=}440{,}056$, $U{=}4{,}565$, $T{=}712$K
-& Aug. 2014
-& MAE$\downarrow$, MAPE$\downarrow$, RMSE$\downarrow$ \\
-
-\midrule
-
-\multirow{5}{*}{Map Matching}
-& Neftekamsk
-& $N{=}18{,}195$, $E{=}41{,}971$, $T{=}2.5$K
-& 2015
-& RMF$\downarrow$, AL$\uparrow$ \\
-
-& Santander
-& $N{=}24{,}217$, $E{=}48{,}100$, $T{=}653$
-& 2015
-& RMF$\downarrow$, AL$\uparrow$ \\
-
-& Spaichingen
-& $N{=}4{,}575$, $E{=}9{,}992$, $T{=}517$
-& 2015
-& RMF$\downarrow$, AL$\uparrow$ \\
-
-& Valky
-& $N{=}1{,}578$, $E{=}3{,}142$, $T{=}1.0$K
-& 2015
-& RMF$\downarrow$, AL$\uparrow$ \\
-
-\bottomrule
-\end{tabular}}
-\end{table*}
-\begin{table}[t]
-    \centering
-    \caption{Traffic state prediction leaderboard on METR\_LA, PEMSD7, and PEMS\_BAY under unified evaluation protocols.}
-
-    \label{tab:traffic_leaderboard}
-    \resizebox{1\linewidth}{!}{
-    \begin{tabular}{l cc cc cc}
-        \toprule
-        \textbf{Model} &
-        \multicolumn{2}{c}{\textbf{METR\_LA}} &
-        \multicolumn{2}{c}{\textbf{PEMSD7}} &
-        \multicolumn{2}{c}{\textbf{PEMS\_BAY}} \\
-        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
-        & MAE$\downarrow$ & RMSE$\downarrow$
-        & MAE$\downarrow$ & RMSE$\downarrow$
-        & MAE$\downarrow$ & RMSE$\downarrow$ \\
-        \midrule
-        STAEformer\cite{STAEformer} & 2.962 & 5.984 & 18.96 & 32.28 & 1.532 & 3.446 \\
-        DCST\cite{DCST}       & 3.090 & 6.334 & 19.39 & 32.72 & 1.561 & 3.483 \\
-        DST2former\cite{DST2former} & 3.095 & 6.240 & 19.67 & 32.61 & 1.639 & 3.587 \\
-        STDMAE\cite{STDMAE}     & 3.096 & 6.230 & 20.19 & 32.99 & 1.579 & 3.502 \\
-        EasyST\cite{EasyST}     & 3.115 & 6.419 & 19.49 & 32.48 & 1.565 & 3.509 \\
-        PatchSTG\cite{PatchSTG}   & 3.127 & 6.316 & 19.99 & 32.90 & 1.589 & 3.580 \\
-        HiMSNet\cite{HiMSNet}    & 3.143 & 6.221 & 23.34 & 36.04 & 1.670 & 3.613 \\
-        STLLM\cite{STLLM}       & 3.151 & 6.284 & 20.92 & 33.65 & 1.616 & 3.592 \\
-        LightST\cite{LightST}    & 3.167 & 6.372 & 22.00 & 34.59 & 1.607 & 3.580 \\
-        STWave\cite{STWave}     & 3.186 & 6.417 & 23.02 & 37.04 & 1.619 & 3.621 \\
-        RSTIB\cite{RSTIB}       & 3.194 & 6.606 & 20.37 & 33.40 & 1.610 & 3.666 \\
-        FlashST\cite{FlashST}    & 3.203 & 6.511 & 22.40 & 35.47 & 1.636 & 3.645 \\
-        BigST\cite{BigST}       & 3.218 & 6.359 & 21.11 & 34.18 & 1.622 & 3.538 \\
-        TRACK\cite{TRACK}       & 3.278 & 6.710 & 25.82 & 39.31 & 1.749 & 4.007 \\
-        DSTAGNN\cite{DSTAGNN}    & 3.331 & 6.599 & 22.73 & 36.04 & 1.745 & 3.800 \\
-        GriddedTNP\cite{GriddedTNP} & 3.412 & 6.989 & 29.83 & 53.10 & 2.379 & 5.099 \\
-        EAC\cite{EAC}         & 3.532 & 6.915 & 26.61 & 40.23 & 1.834 & 4.045 \\
-        AutoSTF\cite{AutoSTF}    & 3.977 & 9.406 & 19.72 & 32.56 & 1.544 & 3.446 \\
-        Fredformer\cite{Fredformer} & 4.159 & 9.014 & 24.16 & 38.54 & 1.866 & 4.214 \\
-        ConvTimeNet\cite{Convtimenet} & 4.250 & 9.249 & 29.18 & 45.33 & 2.014 & 4.650 \\
-        LEAF\cite{LEAF}        & 4.407 & 9.989 & 28.49 & 43.17 & 1.886 & 4.101 \\
-        SRSNet\cite{SRSNet}     & 4.882 & 10.348& 32.12 & 48.80 & 2.163 & 4.923 \\
-        \bottomrule
-    \end{tabular}}
-\end{table}
-
-
-\begin{table}[t]
-    \centering
-\caption{Trajectory location prediction leaderboard on Foursquare\_NYC, Foursquare\_TKY, and Singapore.}
-
-    \label{tab:traj_leaderboard}
-    \resizebox{1\linewidth}{!}{
-    \begin{tabular}{l cc cc cc}
-        \toprule
-        \textbf{Model} &
-        \multicolumn{2}{c}{\textbf{Foursquare\_NYC}} &
-        \multicolumn{2}{c}{\textbf{Foursquare\_TKY}} &
-        \multicolumn{2}{c}{\textbf{Singapore}} \\
-        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
-        & Acc@1$\uparrow$ & Acc@5$\uparrow$
-        & Acc@1$\uparrow$ & Acc@5$\uparrow$
-        & Acc@1$\uparrow$ & Acc@5$\uparrow$ \\
-        \midrule
-        ROTAN\cite{ROTAN} & 0.1302 & 0.2805 & 0.1897 & 0.3653 & 0.1631 & 0.3331 \\
-        GNPRSID\cite{GNPRSID} & 0.1591 & 0.3419 & 0.1658 & 0.3746 & 0.1539 & 0.3471 \\
-        RNTrajRec\cite{RNTrajRec} & 0.1605 & 0.3231 & 0.1539 & 0.3305 & 0.1378 & 0.2978 \\
-        DeepMove\cite{DeepMove} & 0.1572 & 0.3739 & 0.1800 & 0.3869 & 0.1298 & 0.3096 \\
-        PLSPL\cite{PLSPL} & 0.1034 & 0.3211 & 0.1732 & 0.3596 & 0.1527 & 0.3294 \\
-        CANOE\cite{CANOE} & 0.1147 & 0.2883 & 0.1535 & 0.3485 & 0.1366 & 0.3089 \\
-        LoTNext\cite{LoTNext} & 0.0856 & 0.2402 & 0.1322 & 0.3890 & 0.1365 & 0.3576 \\
-        DCHL\cite{DCHL} & 0.1009 & 0.3141 & 0.0706 & 0.2507 & 0.0889 & 0.2678 \\
-        \bottomrule
-    \end{tabular}}
-\end{table}
-
-
-\begin{table}[t]
-    \centering
-    \caption{ETA prediction leaderboard on Beijing and Chengdu.}
-    \label{tab:eta_leaderboard}
-    \resizebox{\linewidth}{!}{
-    \begin{tabular}{l ccc ccc}
-        \toprule
-        \multirow{2}{*}{\textbf{Model}} &
-        \multicolumn{3}{c}{\textbf{Beijing}} &
-        \multicolumn{3}{c}{\textbf{Chengdu}} \\
-        \cmidrule(lr){2-4} \cmidrule(lr){5-7}
-        & MAE$\downarrow$ & MAPE$\downarrow$ & RMSE$\downarrow$
-        & MAE$\downarrow$ & MAPE$\downarrow$ & RMSE$\downarrow$ \\
-        \midrule
-        HetETA\cite{HetETA}    & 125.67 & 0.105 & 222.91 & 190.56 & 0.113 & 308.56 \\
-        DeepTTE\cite{Chengdu/DeepTTE}   & 224.46 & 0.208 & 351.74 & 317.38 & 0.220 & 429.09 \\
-        MVSTM\cite{MVSTM}     & 279.08 & 0.270 & 430.98 & 255.18 & 0.189 & 343.43 \\
-        MulT-TTE\cite{MulT-TTE} & 280.36 & 0.274 & 432.43 & 465.59 & 0.381 & 580.25 \\
-        DOT\cite{DOT}       & 364.85 & 0.382 & 547.62 & 209.74 & 0.163 & 286.02 \\
-        MetaTTE\cite{MetaTTE}   & 372.15 & 0.347 & 562.24 & 394.52 & 0.300 & 511.63 \\
-        DutyTTE\cite{DutyTTE}   & 431.59 & 0.460 & 572.96 & 243.13 & 0.171 & 443.44 \\
-        \bottomrule
-    \end{tabular}}
-\end{table}
-
-
-\begin{table}[t]
-    \centering
-    \caption{Map matching leaderboard on Santander, Spaichingen, Neftekamsk, and Valky.}
-    \label{tab:mm_leaderboard}
-    \resizebox{0.9\linewidth}{!}{
-    \begin{tabular}{l cc cc cc cc}
-        \toprule
-        \multirow{2}{*}{\textbf{Model}} &
-        \multicolumn{2}{c}{\textbf{Santander}} &
-        \multicolumn{2}{c}{\textbf{Spaichingen}} &
-        \multicolumn{2}{c}{\textbf{Neftekamsk}} &
-        \multicolumn{2}{c}{\textbf{Valky}} \\
-        \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
-        & RMF$\downarrow$ & AL$\uparrow$
-        & RMF$\downarrow$ & AL$\uparrow$
-        & RMF$\downarrow$ & AL$\uparrow$
-        & RMF$\downarrow$ & AL$\uparrow$ \\
-        \midrule
-        FMM\cite{FMM}         & 0.018 & 1.000 & 0.000 & 1.000 & 0.852 & 0.193 & 0.329 & 0.671 \\
-        HMMM\cite{HMMM}        & 0.021 & 0.997 & 0.035 & 1.000 & 0.391 & 0.999 & 0.433 & 1.000 \\
-        STMatching\cite{STMatching}  & 0.674 & 0.998 & 0.088 & 1.000 & 0.457 & 1.000 & 0.436 & 1.000 \\
-        DeepMM\cite{DeepMM}      & 0.981 & 0.019 & 0.947 & 0.053 & 0.889 & 0.111 & 0.909 & 0.091 \\
-        L2MM\cite{L2MM}        & 1.132 & 0.057 & 1.632 & 0.158 & 0.778 & 0.222 & 2.455 & 0.182 \\
-        RLOMM\cite{RLOMM}       & 0.920 & 0.280 & 2.760 & 0.240 & 7.440 & 0.120 & 3.000 & 0.600 \\
-        \bottomrule
-    \end{tabular}}
-\end{table}
-\section{The AgentCity Benchmark}
-\label{sec:benchmark_release}
-
-\subsection{Benchmark Scope and Coverage}
-\label{sec:benchmark_scope}
-
-AgentCity supports a unified benchmark that spans multiple traffic prediction tasks and datasets.
-At the time of writing, the benchmark covers four representative traffic prediction tasks, including traffic state prediction, trajectory location prediction, ETA prediction, and map matching.
-Across these tasks, AgentCity aggregates a diverse collection of publicly available datasets and model implementations. 
-
-Table~\ref{tab:dataset_stats} summarizes the datasets included in the benchmark.
-In total, AgentCity covers 26 publicly available datasets across the four traffic prediction tasks.
-These datasets span heterogeneous spatial representations and temporal resolutions, including graph-based, grid-based, and origin--destination data for traffic state prediction, as well as trajectory datasets represented as variable-length sequences of locations or GPS points.
-For ETA prediction and map matching, the benchmark includes GPS trajectory datasets with different scales in terms of trajectory volume and network size.
-
-
-Table~\ref{tab:model_stats} summarizes the traffic prediction models currently included in AgentCity.
-For each task, the benchmark integrates a representative set of models that follow heterogeneous modeling assumptions and architectural designs.
-All models are reproduced and evaluated under unified task definitions and evaluation protocols, enabling consistent comparison within and across tasks.
-
-Across tasks, the benchmark includes datasets defined on sensor networks, region-based spatial partitions, road network graphs, and individual trajectories.
-Traffic state prediction datasets are typically defined on fixed sensor networks with regular temporal sampling, while trajectory-based datasets represent individual mobility as sequences of locations or GPS points.
-Map matching datasets are constructed on explicit road networks and focus on network-constrained trajectory inference.
-Together, these datasets capture both group-level and individual-level traffic dynamics under heterogeneous spatial settings.
-
-
-\subsection{Literature Coverage Analysis}
-\label{sec:literature_analysis}
-
-To characterize the literature coverage of the benchmark, we analyze the distribution of studies included through AgentCity across publication venues, years, and traffic prediction tasks.
-Figure~\ref{fig:analysis} summarizes these statistics based on the models that have been reproduced and integrated into the benchmark.
-
-In total, the benchmark includes 74 research papers published in recent years.
-These papers span multiple traffic prediction tasks, with 36 studies on traffic state prediction, 18 on trajectory location prediction, 11 on estimated time of arrival (ETA) prediction, and 9 on map matching.
-This task distribution reflects the relative research activity across different traffic prediction problems.
-
-The venue distribution indicates that many collected studies originate from major data mining and machine learning venues, with KDD representing the largest share.
-In addition, a notable portion of models are released through arXiv, reflecting active research activity beyond traditional conference venues.
-
-
-The year distribution indicates that most included studies were published between 2023 and 2025.
-This concentration reflects the recent growth of research activity in traffic prediction and related areas.
-These statistics provide a descriptive overview of the literature represented in the benchmark and clarify the scope of models evaluated in AgentCity.
-
-
-
-\subsection{Task-wise Leaderboards}
-\label{sec:leaderboards}
-
-
-This subsection presents representative leaderboard results for four core traffic prediction tasks under unified evaluation protocols.
-The reported results provide a task-wise view of model performance under consistent data processing, training, and evaluation settings.
-
-Traffic state prediction results are reported on METR\_LA, PEMSD7, and PEMS\_BAY; trajectory location prediction on Foursquare (NYC, TKY) and Singapore; ETA prediction on Beijing and Chengdu; and map matching on selected cities from the Global dataset.
-All models are evaluated within a unified framework, with hyperparameters systematically tuned via AgentCity.
-Training is controlled using early stopping based on validation loss, and the checkpoint with the best validation performance is selected for evaluation.
-
-Table~\ref{tab:task_dataset_overview} summarizes the datasets used in the reported benchmark results, together with their basic statistics and evaluation protocols.
-Tables~\ref{tab:traffic_leaderboard}--\ref{tab:mm_leaderboard} present the corresponding task-wise leaderboard results under consistent evaluation settings.
-
-For clarity and space considerations, we report results on a representative subset of widely used datasets and models for each task, following standard evaluation settings in prior studies.
-The complete benchmark results, covering additional datasets and model implementations, are available through the online leaderboard.
-
-\begin{figure}
-\centering
-\begin{subfigure}[b]{0.48\linewidth}
-\hspace{-3px}
-\includegraphics[width=\linewidth]{figures/Frontend.png}
-\caption{Benchmark Homepage}
-\end{subfigure}
-\hfill
-\begin{subfigure}[b]{0.48\linewidth}
-\hspace{-3px}
-\includegraphics[width=\linewidth]{figures/LeaderBoard.png}
-\caption{AgentCity Interface}
-\end{subfigure}
-\caption{The AgentCity platform.
-The benchmark homepage presents benchmark statistics and public leaderboards.
-The AgentCity interface provides an interactive environment for the agent-driven workflow.}
-
-\label{fig:AgentCity}
-\end{figure}
-\subsection{Benchmark Access and Usage}
-\label{sec:benchmark_access}
-
-The AgentCity benchmark is publicly accessible.
-Figure~\ref{fig:AgentCity} presents the project homepage and the AgentCity user interface, which together provide benchmark information, evaluation results, and guidance for executing the benchmark workflow with AgentCity.
-
-The project homepage introduces the overall scope of AgentCity, including the supported traffic prediction tasks, benchmark organization, and evaluation protocols.
-It provides documentation for installing and running AgentCity and presents detailed task-wise leaderboards that report benchmark results under unified evaluation settings.
-The AgentCity user interface allows users to interactively execute the benchmark construction workflow described in this paper.
-Through the interface, users can run the three stages of literature retrieval, model and data integration, and standardized evaluation, and examine the corresponding outputs.
-Execution logs, intermediate artifacts, and analysis results from each stage are displayed to support inspection of the benchmark process.
-
-Detailed usage instructions, task-wise leaderboards, and documentation of the unified evaluation framework are available through the project website and source code repository.\footnote{\fulllink}
-
-\begin{table}[t]
-    \centering
-    \caption{Comparison between reported results and reproduced results in terms of MAE and RMSE.}
-    \label{tab:mae_rmse_comparison}
-    \resizebox{\linewidth}{!}{%
-    \begin{tabular}{l l cc cc c}
-        \toprule
-        \multirow{2}{*}{\textbf{Model}} & \multirow{2}{*}{\textbf{Dataset}} & 
-        \multicolumn{2}{c}{\textbf{Paper Reported}} & 
-        \multicolumn{2}{c}{\textbf{Reproduced}} & 
-        \multirow{2}{*}{\textbf{Gap (\%)}} \\
-        \cmidrule(lr){3-4} \cmidrule(lr){5-6}
-        & & MAE & RMSE & MAE & RMSE & \\
-        \midrule
-        DSTAGNN & PEMSD4   & 19.30 & 31.46 & 19.90 & 31.29 & 0.85 \\
-        LightST & PEMSD7   & 20.78 & 33.95 & 21.99 & 34.59 & 3.38 \\
-        RSTIB   & PEMSD7   & 19.84 & 33.90 & 20.37 & 33.40 & 0.06 \\
-        STDMAE  & METR\_LA & 3.00  & 5.98  & 3.09  & 6.23  & 3.79 \\
-        LSTTN   & METR\_LA & 2.96  & 5.92  & 3.08  & 6.12  & 3.60 \\
-        AutoSTF & PEMS\_BAY & 1.55  & 3.51  & 1.54  & 3.44  & -1.58 \\
-        DCST    & PEMS\_BAY & 1.55  & 3.50  & 1.56  & 3.48  & -0.20 \\
-        \bottomrule
-    \end{tabular}%
-    }
-\end{table}
-
-
-\begin{table*}[t]
-    \centering
-    \caption{Comparison of reproduction consistency between AgentCity and other code-oriented agents.}
-    \label{tab:selected_models}
-    \resizebox{0.8\linewidth}{!}{
-    \begin{tabular}{l ccc ccc ccc ccc}
-        \toprule
-        \multirow{2}{*}{\textbf{Source}} & 
-        \multicolumn{3}{c}{\textbf{STDMAE(PEMSD7)}} & 
-        \multicolumn{3}{c}{\textbf{LightST(PEMSD7)}} & 
-        \multicolumn{3}{c}{\textbf{LSTTN(METR\_LA)}} & 
-        \multicolumn{3}{c}{\textbf{DSTAGNN(PEMSD4)}} \\
-        \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10} \cmidrule(lr){11-13}
-        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$
-        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ 
-        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ 
-        & \small MAE$\downarrow$ & \small RMSE$\downarrow$ & \small Gap\%$\downarrow$ \\
-        \midrule
-        Reported~(Paper) &
-        18.65 & 31.44 & 0.00 & 
-        20.78 & 33.95 & 0.00 & 
-        2.96 & 5.92 & 0.00 & 
-        19.30 & 31.46 & 0.00 \\
-        SWE-agent & 
-        31.96 & 45.87 & 55.38 & 
-        22.21 & 34.76 & 4.09 & 
-        4.50 & 9.84 & 61.49 & 
-        20.11 & 31.48 & 1.64 \\
-        OpenHands & 
-        21.79 & 34.55 & 12.48 & 
-        26.18 & 38.89 & 18.89 & 
-        6.55 & 11.80 & 106.64 & 
-        20.27 & 31.97 & 2.91 \\
-        \textbf{AgentCity} & 
-        \textbf{20.19} & \textbf{32.99} & \textbf{6.17} & 
-        \textbf{21.99} & \textbf{34.59} & \textbf{3.38} & 
-        \textbf{3.08} & \textbf{6.12} & \textbf{3.60} & 
-        \textbf{19.90} & \textbf{31.29} & \textbf{0.85} \\
-        \bottomrule
-    \end{tabular}}
-\end{table*}
-
-
-\section{Benchmark Validation}
-\label{sec:validation}
-
-\subsection{Reproduction Fidelity}
-\label{sec:fidelity}
-
-We evaluate the reproduction fidelity of AgentCity by comparing reproduced results with the metrics reported in the original papers.
-This analysis examines whether AgentCity reproduces results that are consistent with those reported in prior studies.
-
-
-We focus on the traffic state prediction task, which has well-established datasets and evaluation protocols and is commonly used in the literature.
-Seven representative models are selected for analysis, covering different architectural designs and training strategies.
-For each model--dataset pair, we report the MAE and RMSE values stated in the original paper together with the corresponding results reproduced by AgentCity.
-The relative gap between reported and reproduced results is summarized in Table~\ref{tab:mae_rmse_comparison}.
-
-Across the examined models and datasets, the reproduced results are generally close to the reported values.
-Differences between reproduced results and reported values can arise from software and hardware environments, nondeterministic training behavior, and minor implementation variations.
-All results are obtained using a consistent reproduction and evaluation process without manual intervention, indicating that AgentCity reproduces published traffic prediction models with reasonable fidelity.
-
-\subsection{Reproduction Consistency Across Code Agents}
-\label{sec:agent_comparison}
-
-We compare the reproduction results obtained by AgentCity with those produced by two general-purpose code-oriented agents, SWE-agent~\cite{Swe-agent} and OpenHands~\cite{OpenHands}.
-The comparison examines reproduction consistency, defined as how closely reproduced results match the metrics reported in the original papers.
-
-All agents are evaluated under the same reproduction setting with Claude-4.5-Opus as the underlying language model, operate on the same code repositories and datasets, and follow the same reproduction objective of matching reported MAE and RMSE values.
-The prompts used to specify reproduction tasks are identical across agents and are described in Appendix~\ref{Model Adapter}.
-Each agent is allowed to iteratively execute, debug, and rerun code until a valid training and evaluation pipeline is completed.
-No manual intervention or task-specific adjustment is performed for any agent during the reproduction process.
-Table~\ref{tab:selected_models} summarizes the reproduction results.
-For each model--dataset pair, the table reports the metrics stated in the original paper together with the reproduced MAE, RMSE, and relative gaps.
-Across the evaluated cases, AgentCity produces reproduced results that are closer to the reported values than those obtained by the other agents under the same reproduction setting.
-\section{Related Work}
-
-
-\subsection{Traffic Prediction Benchmarks}
-
-Benchmark research in traffic prediction has progressed from unified deep learning toolkits toward more diverse evaluation settings.
-Early benchmarks such as LibCity~\cite{Libcity}, DL-Traff~\cite{Dl-traff}, and TorchSpatial~\cite{Torchspatial} focus on standardizing data processing, task definitions, and evaluation protocols for traffic prediction models, providing a common basis for reproducible comparison of predictive performance.
-More recent efforts, including CityBench~\cite{CityBench}, STBench~\cite{STBench}, and USTBench~\cite{USTBench}, extend benchmarking beyond predictive accuracy to assess semantic understanding, reasoning, and planning capabilities of general-purpose models in urban and transportation scenarios.
-Despite this progress, most existing traffic prediction benchmarks are constructed and maintained through largely manual processes.
-The automation and continuous maintenance of the benchmarking workflow remain insufficiently addressed.
-
-
-\subsection{LLM Agents for Automated Reproduction and Benchmarking}
-
-Recent advances in large language model (LLM) agents have enabled tighter coupling between natural language reasoning and automated code generation in scientific workflows.
-General-purpose frameworks such as SWE-agent~\cite{Swe-agent} and OpenHands~\cite{OpenHands} demonstrate the ability to navigate and modify complex code repositories, while more specialized systems, including ML-Master~\cite{ML-Master} and PiML~\cite{PiML}, focus on automating and optimizing machine learning pipelines.
-Building on these capabilities, research-oriented agents such as DeepCode~\cite{DeepCode}, Paper2Code~\cite{Paper2code}, and Agent Laboratory~\cite{Agentlaboratory} aim to support broader stages of the scientific process, ranging from algorithm understanding to experiment execution and reproduction~\cite{Autoreproduce}.
-Despite this progress, most existing LLM-based agents are designed for general-purpose code interaction and research automation.
-Their workflows do not explicitly account for the domain-specific requirements of traffic and spatiotemporal reproduction, such as heterogeneous data organization, task-specific preprocessing pipelines, and structured spatial representations.
-
-\section{Conclusion}
-
-In this work, we present AgentCity, an AI-maintained framework for the continuous construction and evaluation of traffic prediction benchmarks.
-AgentCity formulates benchmark maintenance as a structured, agent-driven workflow that automates literature retrieval, model and data integration, and standardized evaluation under unified protocols, including systematic hyperparameter tuning, enabling benchmark construction to be treated as an ongoing and scalable process rather than a one-time manual effort.
-Built on this framework, we release a publicly accessible traffic prediction benchmark that spans multiple representative tasks, integrates diverse datasets and model implementations, and provides task-wise leaderboards under consistent evaluation settings.
-We further validate the reliability of the framework by comparing reproduced results with those reported in original papers and with results obtained by general-purpose code-oriented agents under the same reproduction settings, demonstrating stable and consistent reproduction performance.
-AgentCity enables continuous and scalable maintenance of traffic prediction benchmarks under unified evaluation protocols, providing a reproducible basis for integrating and evaluating models as the benchmark evolves.
--- a/mypaper/arXiv_POI-QA.bib
+++ b/mypaper/arXiv_POI-QA.bib
@@ -1,244 +0,0 @@
-@article{bjerva2020subjqa,
-  author = {Johannes Bjerva and Nikita Bhutani and Behzad Golshan and Wang-Chiew Tan and Isabelle Augenstein},
-  title = {SubjQA: A Dataset for Subjectivity and Review Comprehension},
-  journal = {arXiv preprint arXiv:2004.14283},
-  eprint = {2004.14283},
-  archivePrefix = {arXiv},
-  year = {2020}
-}
-
-@inproceedings{contractor2021answering,
-  author = {Danish Contractor and Krunal Shah and Aditi Partap and Parag Singla and Mausam},
-  title = {Answering POI-Recommendation Questions Using Tourism Reviews},
-  booktitle = {Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
-  pages = {281--291},
-  year = {2021}
-}
-
-@inproceedings{deng2023spatio,
-  author = {Pan Deng and Yu Zhao and Junting Liu and Xiaofeng Jia and Mulan Wang},
-  title = {Spatio-Temporal Neural Structural Causal Models for Bike Flow Prediction},
-  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
-  volume = {37},
-  pages = {4242--4249},
-  year = {2023}
-}
-
-@article{dong2022spatiotemporal,
-  author = {Qidi Dong and Jun Cai and Shuo Chen and Pengman He and Xuli Chen},
-  title = {Spatiotemporal Analysis of Urban Green Spatial Vitality and the Corresponding Influencing Factors: A Case Study of Chengdu, China},
-  journal = {Land},
-  volume = {11},
-  number = {10},
-  pages = {1820},
-  year = {2022}
-}
-
-@article{feng2024citygpt,
-  author = {Jie Feng and Yuwei Du and Tianhui Liu and Siqi Guo and Yuming Lin and Yong Li},
-  title = {CityGPT: Empowering Urban Spatial Cognition of Large Language Models},
-  journal = {arXiv preprint arXiv:2406.13948},
-  eprint = {2406.13948},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
-
-@article{grattafiori2024llama,
-  author = {Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and others},
-  title = {The Llama 3 Herd of Models},
-  journal = {arXiv preprint arXiv:2407.21783},
-  eprint = {2407.21783},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
-
-@article{gruber2024complextempqa,
-  author = {Raphael Gruber and Abdelrahman Abdallah and Michael F{\"a}rber and Adam Jatowt},
-  title = {ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering},
-  journal = {arXiv preprint arXiv:2406.04866},
-  eprint = {2406.04866},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
-
-@article{hu2022lora,
-  author = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen and others},
-  title = {LoRA: Low-Rank Adaptation of Large Language Models},
-  journal = {ICLR},
-  volume = {1},
-  number = {2},
-  pages = {3},
-  year = {2022}
-}
-
-@inproceedings{jia2018tempquestions,
-  author = {Zhen Jia and Abdalghani Abujabal and Rishiraj Saha Roy and Jannik Str{\"o}tgen and Gerhard Weikum},
-  title = {TempQuestions: A Benchmark for Temporal Question Answering},
-  booktitle = {Companion Proceedings of the The Web Conference 2018},
-  pages = {1057--1062},
-  year = {2018}
-}
-
-@article{joshi2017triviaqa,
-  author = {Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer},
-  title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
-  journal = {arXiv preprint arXiv:1705.03551},
-  eprint = {1705.03551},
-  archivePrefix = {arXiv},
-  year = {2017}
-}
-
-@article{kwiatkowski2019natural,
-  author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and others},
-  title = {Natural Questions: A Benchmark for Question Answering Research},
-  journal = {Transactions of the Association for Computational Linguistics},
-  volume = {7},
-  pages = {453--466},
-  year = {2019}
-}
-
-@article{lewis2020retrieval,
-  author = {Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K{\"u}ttler and Mike Lewis and Wen-tau Yih and Tim Rockt{\"a}schel and others},
-  title = {Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks},
-  journal = {Advances in Neural Information Processing Systems},
-  volume = {33},
-  pages = {9459--9474},
-  year = {2020}
-}
-
-@article{li2024stbench,
-  author = {Wenbin Li and Di Yao and Ruibo Zhao and Wenjie Chen and Zijie Xu and Chengxue Luo and Chang Gong and Quanliang Jing and Haining Tan and Jingping Bi},
-  title = {STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis},
-  journal = {arXiv preprint arXiv:2406.19065},
-  eprint = {2406.19065},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
-
-@inproceedings{DBLP:conf/ijcai/LiCLYH21,
-  author = {Yang Li and Tong Chen and Yadan Luo and Hongzhi Yin and Zi Huang},
-  title = {Discovering Collaborative Signals for Next {POI} Recommendation with Iterative Seq2Graph Augmentation},
-  booktitle = {Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, {IJCAI} 2021},
-  pages = {1491--1497},
-  year = {2021},
-  doi = {10.24963/IJCAI.2021/206},
-  url = {https://doi.org/10.24963/ijcai.2021/206}
-}
-
-@article{li2025mapqa,
-  author = {Zekun Li and Malcolm Grossman and Mihir Kulkarni and Muhao Chen and Yao-Yi Chiang and others},
-  title = {MapQA: Open-Domain Geospatial Question Answering on Map Data},
-  journal = {arXiv preprint arXiv:2503.07871},
-  eprint = {2503.07871},
-  archivePrefix = {arXiv},
-  year = {2025}
-}
-
-@article{ma2023evolution,
-  author = {Dongling Ma and Baoze Liu and Qingji Huang and Qian Zhang},
-  title = {Evolution Characteristics and Causes---An Analysis of Urban Catering Cluster Spatial Structure},
-  journal = {ISPRS International Journal of Geo-Information},
-  volume = {12},
-  number = {8},
-  pages = {302},
-  year = {2023}
-}
-
-@inproceedings{mai2018poireviewqa,
-  author = {Gengchen Mai and Krzysztof Janowicz and Cheng He and Sumang Liu and Ni Lao},
-  title = {POIReviewQA: A Semantically Enriched POI Retrieval and Question Answering Dataset},
-  booktitle = {Proceedings of the 12th Workshop on Geographic Information Retrieval},
-  pages = {1--2},
-  year = {2018}
-}
-
-@article{mateos2025systematic,
-  author = {Pablo Mateos and Alejandro Bellog{\'\i}n},
-  title = {A Systematic Literature Review of Recent Advances on Context-Aware Recommender Systems},
-  journal = {Artificial Intelligence Review},
-  volume = {58},
-  number = {1},
-  pages = {1--53},
-  year = {2025}
-}
-
-@article{tang2022discovering,
-  author = {Wen Tang and Alireza Chakeri and Hamid Krim},
-  title = {Discovering Urban Functional Zones from Biased and Sparse Points of Interests and Sparse Human Activities},
-  journal = {Expert Systems with Applications},
-  volume = {207},
-  pages = {118062},
-  year = {2022}
-}
-
-@article{wan2023spatio,
-  author = {Zhongwei Wan and Xin Liu and Benyou Wang and Jiezhong Qiu and Boyu Li and Ting Guo and Guangyong Chen and Yang Wang},
-  title = {Spatio-Temporal Contrastive Learning-Enhanced GNNs for Session-Based Recommendation},
-  journal = {ACM Transactions on Information Systems},
-  volume = {42},
-  number = {2},
-  pages = {1--26},
-  year = {2023}
-}
-
-@article{wang2024environmental,
-  author = {Hongcheng Wang and Linfei Li and Xin Xu},
-  title = {Do Environmental Regulation Policies Increase Urban Boundary Pollution? Micro Evidence from Chinese Industrial Enterprises},
-  journal = {Environmental Impact Assessment Review},
-  volume = {106},
-  pages = {107524},
-  year = {2024}
-}
-
-@article{wang2021spatio,
-  author = {Huandong Wang and Qiaohong Yu and Yu Liu and Depeng Jin and Yong Li},
-  title = {Spatio-Temporal Urban Knowledge Graph Enabled Mobility Prediction},
-  journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
-  volume = {5},
-  number = {4},
-  pages = {1--24},
-  year = {2021}
-}
-
-@article{yang2024qwen2,
-  author = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and others},
-  title = {Qwen2.5 Technical Report},
-  journal = {arXiv preprint arXiv:2412.15115},
-  eprint = {2412.15115},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
-
-@inproceedings{yang2015wikiqa,
-  author = {Yi Yang and Wen-tau Yih and Christopher Meek},
-  title = {WikiQA: A Challenge Dataset for Open-Domain Question Answering},
-  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
-  pages = {2013--2018},
-  year = {2015}
-}
-
-@article{yu2024survey,
-  author = {Jian Yu and Lucas Guo and Jiayu Zhang and Guiling Wang},
-  title = {A Survey on Graph Neural Network-Based Next POI Recommendation for Smart Cities},
-  journal = {Journal of Reliable Intelligent Environments},
-  volume = {10},
-  number = {3},
-  pages = {299--318},
-  year = {2024}
-}
-
-@book{yu2017chinese,
-  author = {Li Yu},
-  title = {Chinese City and Regional Planning Systems},
-  publisher = {Routledge},
-  year = {2017}
-}
-
-@article{yu2024bigcity,
-  author = {Xie Yu and Jingyuan Wang and Yifan Yang and Qian Huang and Ke Qu},
-  title = {BigCity: A Universal Spatiotemporal Model for Unified Trajectory and Traffic State Data Analysis},
-  journal = {arXiv preprint arXiv:2412.00953},
-  eprint = {2412.00953},
-  archivePrefix = {arXiv},
-  year = {2024}
-}
--- a/mypaper/arXiv_POI-QA.tex
+++ b/mypaper/arXiv_POI-QA.tex
@@ -1,532 +0,0 @@
-\title{A Dataset for Spatiotemporal-Sensitive\\POI Question Answering}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-Spatiotemporal relationships are critical in data science, as many prediction and reasoning tasks require analysis across both spatial and temporal dimensions—for instance, navigating an unfamiliar city involves planning itineraries that sequence locations and timing cultural experiences.
-However, existing Question-Answering (QA) datasets lack sufficient spatiotemporal-sensitive questions, making them inadequate benchmarks for evaluating models' spatiotemporal reasoning capabilities.
-To address this gap, we introduce \name, a novel spatiotemporal-sensitive QA dataset centered on Point of Interest (POI), constructed through three key steps: mining and aligning open-source vehicle trajectory data from GAIA with high-precision geographic POI data, rigorous manual validation of noisy spatiotemporal facts, and generating bilingual (Chinese/English) QA pairs that reflect human-understandable spatiotemporal reasoning tasks.
-Our dataset challenges models to parse complex spatiotemporal dependencies, and evaluations of state-of-the-art multilingual LLMs (\emph{e.g.,} Qwen2.5-7B, Llama3.1-8B) reveal stark limitations: even the top-performing model (Qwen2.5-7B fine-tuned with RAG+LoRA) achieves a top 10 Hit Ratio (HR@10) of only 0.41 on the easiest task, far below human performance at 0.56.
-This underscores persistent weaknesses in LLMs’ ability to perform consistent spatiotemporal reasoning, while highlighting \name\ as a robust benchmark to advance algorithms sensitive to spatiotemporal dynamics. The dataset is publicly available at \datalink.
-\end{abstract}
-\section{Introduction}
-
-
-Spatiotemporal reasoning plays a pivotal role in a wide range of prediction and decision-making tasks that require sensitivity to both spatial and temporal contexts.
-This capability depends heavily on spatiotemporal information, which encompasses spatial data, such as geographic locations, and temporal data like time of day or sequential time-based patterns.
-As a result, spatiotemporal reasoning has become an essential focus in recent research across domains including mobility analysis, personalized recommendation, and spatiotemporal prediction tasks~ \cite{wan2023spatio,deng2023spatio,wang2021spatio}.
-The integration of spatiotemporal reasoning into decision-making processes is not confined to technological applications but is also deeply embedded in the daily routines and choices of individuals~\cite{mateos2025systematic}.
-For instance, when planning a journey, travelers often consider factors such as the geographical proximity of restaurants offering local specialties and the time required to reach these establishments.
-This example underscores how both spatial and temporal elements are crucial for making informed decisions.
-Among the domains where spatiotemporal reasoning is essential, Point of Interest (POI) recommendation stands out as a representative and challenging example. To effectively identify appropriate POIs, models must possess robust spatiotemporal reasoning capabilities. These capabilities enable models to analyze historical user behavior patterns, predict future preferences, and recommend POIs that align with users' interests while accounting for constraints like time and location.
-In essence, the ability to reason about space and time is fundamental for developing intelligent recommendation systems that cater to diverse user needs and preferences~\cite{yu2024survey}.
-
-In this paper, we focus on addressing the spatiotemporal challenges of POI prediction with precision and rigor.
-We formally define POI prediction at travel destinations as spatiotemporal questions based on the following four criteria:
-\textbf{i) Spatiotemporal Presence:} The question contains both a timestamp, [time], and a geolocation, [place], such as ``Tuesday evening'' and ``221B Baker Street'';
-\textbf{ii) Spatiotemporal Context Sensitivity:} Answers to similar questions may vary depending on differences in time and/or location, \ie altering the [time] or [place] can result in different answers.
-\textbf{iii) Spatiotemporal Knowledge Reasoning:} Such questions require broad POI data coverage and the ability to perform spatiotemporal reasoning.
-\textbf{iv) Human-Readable Answer:} The answer should align with effective human-computer interaction principles, such as providing the POI name along with a specific address rather than raw latitude and longitude coordinates.
-We found that, despite their ubiquity, spatiotemporal-sensitive questions are under-studied in existing POI QA datasets. 
-For example, SubjQA~\cite{bjerva2020subjqa} focuses on attribute-oriented questions derived from POI reviews, requiring only semantic knowledge and lacking spatial or temporal information. MapQA~\cite{li2025mapqa} supports geographic queries but omits any temporal context. TourismQA~\cite{contractor2021answering}, although built from tourism reviews and containing questions related to time or place, lacks the ability to perform spatiotemporal reasoning.
-All of these datasets do not consider spatiotemporal-sensitive issues as specified in criterion ii).
-
-One of the datasets closest to ours is Foursquare\footnote{https://opensource.foursquare.com/os-places/}, which provides a large amount of POI location information worldwide, along with a large number of user check-in data with timestamps.
-However, question samples extracted from the above-mentioned dataset fail to meet criteria ii), iii), and iv).
-Furthermore, the spatiotemporal information in the Foursquare dataset is relatively sparse and fragmented, as many users check in at different POIs on the platform with gaps of several days.
-Therefore, we propose to construct our own dataset, called \name. We first identify spatiotemporal-evolving relationships from both GAIA trajectory data\footnote{https://outreach.didichuxing.com/} and POI information around those real-time trajectories.
-Then, a massive number of human workers are employed to annotate the POIs surrounding each GPS point in every trajectory, especially focusing on double-checking the POIs near pick-up and drop-off locations.
-Finally, we created bilingual datasets (in Simplified Chinese and English) with multiple levels of granularity, corresponding to different levels of question difficulty. These levels include POI name, POI subcategory, POI medium category, and POI major category. Each level contains over 5,000,000 question-answer pairs, covering about 400,000 distinct POI locations and 30 consecutive days of vehicle trajectory data.
-Using POI names as labels in QA pairs is more challenging, as it requires more spatiotemporal reasoning and natural language understanding compared to other classification tasks.
-Figure~\ref{fig:illustration} shows two trajectories and their corresponding QA examples from the \name\ dataset, constructed using both trajectory facts and synthesized contextual information. Although both vehicles depart at similar times on Tuesday, the spatial variation in their departure points leads to different routes and destination contexts. This example highlights the strong spatiotemporal sensitivity of our dataset, where even slight spatial shifts under similar temporal conditions can significantly impact the question context, requiring models to perform spatiotemporal reasoning.
-3The challenges posed by our dataset fall into three folds:
-\begin{itemize}[leftmargin=*]
-    \item \textbf{Geographic Knowledge Processing}: This involves accurately identifying and categorizing POIs based on their geographic locations. For example, recognizing that a ``McDonald's'' in a bustling city center may have different operating hours compared to one in a quieter suburban area.
-
-    \item \textbf{Temporal Information Understanding}: This requires the system to understand how temporal factors affect POI availability or relevance. For instance, recognizing that a restaurant may be open for dinner on weekdays but closed on weekends.
-
-    \item \textbf{Spatiotemporal Reasoning}: This involves combining both geographic and temporal information to provide accurate predictions. For example, recognizing that a user asking about the best places to eat near their home at 8pm is likely looking for a restaurant that is still open and close to home.
-
-\end{itemize}
-
-\begin{figure}
-    \centering
-    \includegraphics[width=0.95\linewidth]{figs/illustration.png}
-    \caption{A toy example of spatiotemporal sensitive questions.}
-    \label{fig:illustration}
-\end{figure}
-
-We evaluate the performance of different state-of-the-art open-source Large Language Models (LLMs) on \name\ across all levels of granularity and observe that the average HR@10 drops from 0.39 on the coarse-grained ``POI Major Category'' task to 0.21 on the fine-grained ``POI Subcategory'' task, indicating that current models struggle with spatiotemporal understanding and reasoning.
-In contrast, human performance on the POI Subcategory task reaches an HR@10 of 0.57, highlighting a substantial gap between existing advanced models and human capabilities.
-Therefore, we believe \name\ could serve as a valuable benchmark for studying this problem.
-\section{\name\ Dataset}
-
-In this section, we demonstrate the pipeline constructing our dataset, \name.
-It consists of three steps:
-i) geographic annotation of POIs,
-ii) trajectory-based POI mapping, and
-iii) spatiotemporal question-answer pair generation.
-
-\subsection{Geographic Annotation of POIs}
-
-
-Before POI annotation, the choice of POI locations is critical \cite{tang2022discovering,DBLP:conf/ijcai/LiCLYH21}: In sparsely populated areas, POIs tend to be distributed sparsely as well, and the resulting datasets are usually of low quality.
-On a global scale, Chinese cities have the characteristics of high population density and thriving regional economic activities \cite{ma2023evolution,yu2017chinese}.
-These lead to a large number of POIs and rich types, making such cities particularly suitable for POI annotation.
-Therefore, we chose Chengdu, a Chinese city with a population in the tens of millions \cite{dong2022spatiotemporal}, as a suitable location for the dataset.
-
-Although POIs in a city, such as store openings, relocations, or closures, may evolve over time, these dynamic changes are simplified in our constructed dataset to ensure consistency with the time frame of GAIA Data.
-To align with this requirement, we first collected 418,854 POI entries from map engines as of the end of 2016.
-After rigorous screening, we retained 418,579 POIs that remained stable over the period and excluded 275 POIs that had undergone changes.
-The POI annotation process followed four core steps:
-
-\textbf{Data Collection via Map Search Engines}:
-We crawled POI data from two major map search engines in mainland China: Baidu Maps\footnote{https://lbsyun.baidu.com/} and Amap\footnote{https://lbs.amap.com/}. To ensure comprehensive coverage, we partitioned Chengdu into a grid system of 500x500 cells, each approximately 300 meters in length and width. For each grid, we retrieved and queried nearby POIs of the center point from the search engines.
-
-\textbf{Data Cleaning and Standardization}:
-Duplicate entries from the search engine results were removed. Subsequently, we standardized the geographic coordinates of each POI to the WGS84 coordinate system to ensure uniformity \cite{wang2024environmental}.
-
-\textbf{Coordinate Validation and Error Thresholds}:
-We calculated the coordinate discrepancy between the same POI across platforms. POIs with a coordinate difference of <1e-4 were retained and recorded. For discrepancies between 1e-4 and 1e-3, a manual review process was conducted to verify and retain valid POIs. Those with errors exceeding 1e-3 were excluded due to potential inaccuracies.
-
-\textbf{Hierarchical Categorization}: In order to describe the POI more clearly, we manually marked all the collected POIs again.
-Each POI point has 3 category labels: major category, medium category and subcategory.
-For the entire POI dataset, we have divided it into 19 major categories, 122 medium categories and 959 subcategories.
-For more details, please refer to Appendix \ref{app:Dataset}.
-
-This systematic approach ensured the reliability and temporal consistency of the POI dataset in alignment with GAIA Data’s requirements.
-
-\subsection{Trajectory-based POI Mapping}
-
-The POI mapping takes three steps: mining spatiotemporal-evolving travel targets from GAIA data, aligning geographic information with POIs, and human verification.
-
-\textbf{Mining Spatiotemporal-evolving Travel Targets from GAIA Data}: We first utilize existing vehicle location records from GAIA Data to identify trajectories with distinctive spatiotemporal migration patterns.
-Subsequently, we employ this data to mine trips that exhibit temporal and spatial evolution.
-For instance, the vehicle ID ``6c8a8d17e6bbe4cd2fcdb4991b52725e'' in the GAIA Data produces various trip patterns: some travel directly from entertainment venues via main roads to nearby residential areas during weekday evenings, while others divert from community gates to nearby educational institutions on holiday mornings.
-These behaviors reflect clear spatiotemporal orientations, such as individuals returning home after nightlife activities or students attending weekend cram schools.
-By screening and filtering vehicle trajectory records with discernible objectives, we successfully extracted over 6 million trajectories characterized by prominent spatiotemporal migration patterns.
-These trajectories are formatted as: ``carID, timestamp and the location at the pickup point, the positioning sequence during the trip, and the drop-off location.''
-
-\textbf{Aligning Geographic Information with POIs}: 
-In the task of predicting points of interest at travel destinations, it is essential to map POIs along trajectories, particularly focusing on those near the start and end points.
-This approach addresses the need to avoid private information found in order details or GPS sequences.
-Our objectives include anonymization and associating POIs during data processing.
-The process involves four key steps: 
-i) downsampling the trajectory by retaining positioning information at critical intersections and congestion points while eliminating redundancies;
-ii) matching all POIs within a 100-meter radius of start and end points, listed from nearest to farthest;
-iii) using the closest POI for journey positioning points to obscure exact paths;
-and iv) simplifying timestamps to day of the week and hour.
-Each track record is then formatted as: ``anonymous carID, timestamp, POIs near pickup location, POIs during trip, POIs near drop-off location.'' This method ensures privacy while maintaining data utility for effective destination prediction.
-
-\textbf{Human Verification}:
-In the prior step, automated programs generate noisy data in batches. The primary sources of errors include: 
-i) anomalies and drifts in trajectory points within GAIA data; 
-and ii) start or end points situated in city suburbs with low POI coverage, leading to unclear descriptions of trajectory endpoints.
-To address these issues, we employ manual verification by hiring workers.
-This process involves the following measures:
-i) Display the start (end) point of the trajectory alongside nearby POIs (from nearest to farthest, shaded from dark to light) on a single map. Identify and mark records with missing or problematic information, correcting POI details if manually matched.
-ii) Visualize downsampled trajectories directly within the road network. Identify and mark trajectories with obvious anomalies or discontinuities, rectifying waypoints as needed.
-iii) Assign each trajectory record to at least five different workers for evaluation. If a record is flagged by more than 60\% of evaluators, it is either deleted or adjusted according to the majority opinion.
-It ensures data accuracy and reliability through systematic manual verification.
-
-\subsection{Spatiotemporal Question-Answer Pair Generation}
-
-Once we have the precise trajectory-POI matching records, the next step involves generating question-answer pairs that exhibit spatiotemporal correlation.
-
-\begin{table}[ht]
-\centering
-\footnotesize
-\caption{The dataset statistics.}
-\label{tab:difficulty}
-\begin{tabular}{cccc}
-\toprule
-\multicolumn{1}{c}{\bfseries{Type}} & \multicolumn{1}{c}{\bfseries{Difficulty}} & \multicolumn{1}{c}{\makecell{\bfseries{Label }\\ \bfseries{Categories}}} & \multicolumn{1}{c}{\bfseries{Specifier}}\\
-\midrule
-Major Category Classification & Easy & 19 & \makecell{POIs at travel destination are: \\$[$ \re{Lifestyle Services}, \re{Shopping Service}, ...$]$} \\
-&&&\\
-Medium Category Classification & Medium & 122 & \makecell{POIs at travel destination are: \\$[$\re{Beauty Salon}, \re{Supermarket}, ...$]$} \\
-&&&\\
-Subcategory Classification & Hard & 959 & \makecell{POIs at travel destination are: \\$[$\re{Plastic Surgery | Healthcare Services}, \\\re{Hui Kang Supermarket}, \\\re{Wanning Supermarket}, ...$]$} \\
-&&&\\
-POI Name Generation & Very Hard & 400K+
-        & \makecell{POIs at travel destination are: \\$[$\re{tai shi xing cai yi xue mei rong}\\\re{(No. 75 Fuqiang Street)}, \re{Wanning}\\ \re{(cheng du fu li guang chang)}, ...$]$} \\
-\bottomrule
-\end{tabular}
-\end{table}
-
-
-\textbf{Main QA Dataset}:
-Our dataset consists of two components. The first part contains POI information, describing the locations and spatial relationships of various POIs. The second part is our main dataset, specifically designed for predicting POIs at travel destinations. Both datasets are generated using templates. Since the data originates from China, we provide both simplified Chinese and English versions to support multilingual model training.
-
-
-The synthesizing procedure is described in Figure~\ref{fig:QA_sample_synthesizing}.
-As shown in Figure~\ref{fig:QA_sample_synthesizing}, we use '<>' to represent the POI name.
-Since the English translation of most POIs has no specific meaning, we use the three phrases in '()' to represent the major category, medium category, and subcategory of the POI.
-In order to be closer to life and easier for people to understand directly, we also use both addresses in natural language and longitude-latitude coordinates to describe the geographical location of the POI.
-Finally, for each POI, we list the nearby POIs and the distances from these POIs to the current POI in the form of an array from near to far.
-For the POI prediction sample, we take the POI information near the starting point of the vehicle trajectory and the waypoint as the problem, and take the POI near the end of the vehicle trajectory as the predicted label.
-The predicted label is a list form represented by '[]'.
-Each record in the list is a POI point, including the POI name and its corresponding three categories.
-Therefore, we can use this dataset for two major tasks: classification task and generation task, as shown in Table~\ref{tab:difficulty}.
-For the classification task, we hope to build a model that can determine the classification category (major category, medium category, and subcategory) of the POI near the destination; for the generation task, we hope to build a model that can directly output the name of the POI near the destination.
-The difficulty of these four tasks increases in turn, and their comprehensive data are shown in Table~\ref{tab:difficulty}.
-The license information of the dataset is listed in Appendix~\ref{app:accessibility}.
-
-\begin{figure}
-    \centering
-    \includegraphics[width=0.85\linewidth]{figs/QA_sample_synthesizing.png}
-    \caption{QA sample synthesizing.}
-    \label{fig:QA_sample_synthesizing}
-\end{figure}
-
-\textbf{Quality Control}:
-In order to obtain a high-quality dataset, we performed very detailed quality control during the collection process. In the interface, we highlight the annotated POIs and timestamps with special fonts to help annotators identify them. We assign each sample to multiple workers at the same time, and let them score the data quality without knowing each other. If the negative score is higher than 60\%, the sample will be removed. In the final verification step, about 20\% of the records were modified, and we finally obtained 5,417,335 high-quality data samples.
-\section{Models}
-
-In this section, we first present the formal problem definition for POI prediction at travel destinations.
-We then introduce the models used to evaluate the proposed dataset.
-
-\subsection{Learning Problem}
-
-Here we formally define the problem setup.
-The model is given a set of POI information $D_{poi} = \{ poi_1, \cdots, poi_N \}$, and questions $Q = \{ q_1, \cdots, q_M \}$, where each POI information $poi_i, i \in [N]$ and question $q_j, j \in [M]$ is a textual sequence of fewer than 8,000 tokens.
-The model must possess the following capabilities:
-i) Semantic Understanding: Accurately interpret user queries to identify intent and relevant context.
-ii) Information Retrieval: Efficiently search through $D_{poi}$ to extract pertinent POI data based on query requirements.
-iii) Spatiotemporal Analysis: Incorporate location and time-based constraints to effectively filter and rank candidate POIs. %spatial and temporal constraints?
-iv) Human-Computer Interaction: Generate responses that are not only accurate but also presented in a user-friendly manner, ensuring clarity and relevance.
-The model's objective is to generate a response string $\hat{A}$ that accurately answers the query by leveraging these capabilities. This involves selecting the most appropriate POI(s) from $D_{poi}$ based on the query's context and constraints, while maintaining a balance between precision and user experience.
-The approach integrates natural language processing techniques with spatiotemporal reasoning to achieve robust performance across diverse scenarios.
-\subsection{Pre-trained LLMs with SFT and RAG}
-
-To cope with the existing challenges, especially the four capabilities mentioned in the previous paragraph, we adopt two open-source LLMs as base-models: Llama3.1~\cite{grattafiori2024llama} and Qwen2.5~\cite{yang2024qwen2}, which are known to achieve state-of-the-art performance on a wide range of open-world QA tasks (\eg Natural Question~\cite{kwiatkowski2019natural}, TriviaQA~\cite{joshi2017triviaqa}, and WikiQA~\cite{yang2015wikiqa}).
-
-Llama3.1 and Qwen2.5 are both built with transformer-based decoder architectures with support for a 128K context length.
-Llama3.1 introduces Group Query Attention and follows a pretraining pipeline consisting of reward modeling, supervised fine-tuning (SFT), and direct preference optimization (DPO), while Qwen2.5 adopts a two-stage pretraining strategy with RoPE adjusted base frequency (ABF) technology and enhanced Chinese language support.
-Appendix~\ref{app:basemodel} provides a detailed description of their design and training process.
-
-Beyond evaluating model performance on \name\ in a zero-shot setting, we also employ Low-Rank Adaptation (LoRA) fine-tuning \cite{hu2022lora} and Retrieval-Augmented Generation (RAG) \cite{lewis2020retrieval} methods for further assessment. More details are provided in Appendix~\ref{app:LoRA} and \ref{app:RAG}.
-
-\section{Experiments}
-
-In this section, we conduct several baseline experiments to better illustrate our proposed dataset.
-
-\subsection{Experimental Setup}
-Experiments are conducted using two state-of-the-art LLMs as base models that we mentioned before: Llama3.1-8B and Qwen2.5-7B. 
-For the Llama model, we use the English version of the dataset, while the Qwen model uses the Chinese version to generate the best results.
-The content of the two versions of the dataset is exactly the same except for languages.
-Additionally, we employ one specialized model, Deepseek-r1-32B for fine-grained task decomposition and retrieval results summarization and final generation in the RAG pipeline, as detailed in both the Models section and the Appendix~\ref{app:RAG}. 
-We evaluate multiple model variants to analyze the impact of different methods on spatiotemporal reasoning capabilities, including zero-shot, LoRA-based fine-tuning, retrieval-augmented generation (RAG), and a combined LoRA+RAG method.
-
-We utilize a mixed precision training strategy with bf16 to fine-tune all the models using the AdamW optimizer with a learning rate of 1e-4 and a cosine scheduler.
-For LoRA-based methods, the rank is set to 16. Models are fine-tuned for 3 epochs, using a batch size of 24 per GPU.
-The best model is selected based on validation set performance which is constituted from 10\% of the total dataset.
-All training is conducted on NVIDIA A100 with 80G memory running Ubuntu 22.04.
-
-\subsection{Evaluation Metrics}
-
-We evaluate model performance on four answer types: POI name, subcategory, medium category, and major category, covering spatiotemporal reasoning at multiple granularities.
-We designed two evaluation settings differing in how the answer space is defined: \textbf{QA for Classification Tasks} and \textbf{Open-world Generative QA}.
-For both settings, we report Hit Ratio (HR@$k$) and Normalized Discounted Cumulative Gain (NDCG@$k$) at $k\!\in\!\{5,10,20\}$.
-For the generative setting, we additionally compute BLEU-based textual-similarity scores to assess lexical quality.
-Detailed metric definitions are provided in Appendix~\ref{app:metrics}.
-
-\subsection{Main Results}
-\label{exp:main_results}
-
-
-Tables \ref{tab:classification_hr}--\ref{tab:generation_results}
-summarize the primary results across model variants and metrics for the classification tasks and open-world generative QA tasks, respectively.
-Each table reports the performance of the base LLMs, Qwen2.5-7B and Llama3.1-8B, under four experimental configurations: zero-shot, LoRA-based fine-tuning, RAG, and combined RAG+LoRA.
-
-\paragraph{QA for Classification tasks.}
-As shown in both Tables~\ref{tab:classification_hr} and \ref{tab:classification_ndcg}, zero-shot performance is consistently low, confirming that spatiotemporal reasoning remains challenging for out-of-the-box LLMs.
-LoRA and RAG can both enhance model performance.
-Taking $k=10$ as an example,
-LoRA contributes an improvement of 0.05 and 0.09 in HR@10 on average for Llama and Qwe, whereas RAG, through the integration of external spatiotemporal knowledge, achieves a sightly larger gain of 0.06 and 0.13.
-When combined, RAG + LoRA obtains the best result, outperforming the zero-shot baseline 2.5 and 3.9 times on HR@10 and NDCG@10, respectively.
-
-\begin{table}[ht]
-\centering
-\caption{Results for classification tasks. We report HR@\{5,10,20\} for each model variant.}
-\label{tab:classification_hr}
-\small
-\resizebox{1.\linewidth}{!}{
-    \begin{tabular}{l|ccc|ccc|ccc}
-    \toprule
-    \multirow{2}{*}{\textbf{Model}}
-    & \multicolumn{3}{c|}{\textbf{Major Category}}
-    & \multicolumn{3}{c|}{\textbf{Medium Category}}
-    & \multicolumn{3}{c}{\textbf{Subcategory}} \\
-    \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10}
-    & \textbf{{\color{white}{H}}HR@5{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@10{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@20{\color{white}{H}}}
-    & \textbf{{\color{white}{H}}HR@5{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@10{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@20{\color{white}{H}}}
-    & \textbf{{\color{white}{H}}HR@5{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@10{\color{white}{H}}} & \textbf{{\color{white}{H}}HR@20{\color{white}{H}}}\\
-    \midrule 
-    Llama3.1-8B (zero-shot)
-    & 0.0664 & 0.1001 & 0.0917
-    & 0.0281 & 0.0481 & 0.0695
-    & 0.0222 & 0.0350 & 0.0372 \\
-    Qwen2.5-7B (zero-shot)
-    & 0.1017 & 0.1775 & 0.1650
-    & 0.0451 & 0.0784 & 0.0814
-    & 0.0263 & 0.0467 & 0.0673 \\
-    \midrule
-    Llama3.1-8B (LoRA)
-    & 0.1239 & 0.1880 & 0.2067
-    & 0.0590 & 0.1041 & 0.1241
-    & 0.0445 & 0.0687 & 0.0797 \\
-    Qwen2.5-7B (LoRA)
-    & 0.1950 & 0.3222 & 0.3509
-    & 0.1004 & 0.1627 & 0.1871
-    & 0.0611 & 0.1062 & 0.1250 \\
-    \midrule
-    Llama3.1-8B (RAG)
-    & 0.1237 & 0.1770 & 0.2089
-    & 0.0593 & 0.1155 & 0.1328
-    & 0.0461 & 0.0721 & 0.0848 \\
-    Qwen2.5-7B (RAG)
-    & 0.2099 & \underline{0.3821} & 0.3815
-    & 0.0967 & 0.1876 & 0.2008
-    & 0.0650 & 0.1107 & 0.1218 \\
-    \midrule
-    Llama3.1-8B (RAG+LoRA)
-    & \underline{0.2189} & 0.3784 & \underline{0.4356}
-    & \underline{0.1736} & \underline{0.2966} & \underline{0.3379}
-    & \underline{0.1092} & \underline{0.2009} & \underline{0.2324} \\
-    Qwen2.5-7B (RAG+LoRA)
-    & \textbf{0.2339} & \textbf{0.4062} & \textbf{0.4698}
-    & \textbf{0.1812} & \textbf{0.2987} & \textbf{0.3577}
-    & \textbf{0.1288} & \textbf{0.2185} & \textbf{0.2586} \\
-    \bottomrule
-    \end{tabular}
-}
-
-\small{
-Bold and underlined indicates statistically significant improvement \\(\ie using a two-sided t-test with $p<0.05$) over the best baseline.
-}
-\end{table}
-
-\begin{table}[ht]
-\centering
-\caption{Results for classification tasks. We report NDCG@\{5,10,20\} for each model variant.}
-\label{tab:classification_ndcg}
-\small
-\resizebox{1.\linewidth}{!}{
-    \begin{tabular}{l|ccc|ccc|ccc}
-    \toprule
-    \multirow{2}{*}{\textbf{Model}}
-    & \multicolumn{3}{c|}{\textbf{Major Category}}
-    & \multicolumn{3}{c|}{\textbf{Medium Category}}
-    & \multicolumn{3}{c}{\textbf{Subcategory}} \\
-    \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10}
-    & \textbf{NDCG@5} & \textbf{NDCG@10} & \textbf{NDCG@20}
-    & \textbf{NDCG@5} & \textbf{NDCG@10} & \textbf{NDCG@20}
-    & \textbf{NDCG@5} & \textbf{NDCG@10} & \textbf{NDCG@20}\\
-
-    \midrule 
-    Llama3.1-8B (zero-shot)
-    & 0.1073 & 0.1841 & 0.2150
-    & 0.0617 & 0.1241 & 0.1380
-    & 0.0631 & 0.0842 & 0.1141 \\
-    Qwen2.5-7B (zero-shot)
-    & 0.1778 & 0.3130 & 0.3521
-    & 03.1047 & 0.1736 & 0.2369
-    & 0.0910 & 0.1319 & 0.1642 \\
-    \midrule
-    Llama3.1-8B (LoRA)
-    & 0.2085 & 0.3448 & 0.3948
-    & 0.1284 & 0.2268 & 0.2646
-    & 0.1182 & 0.1959 & 0.2247 \\
-    Qwen2.5-7B (LoRA)
-    & 0.3555 & 0.5694 & 0.6976
-    & 0.1968 & 0.3479 & 0.4270
-    & 0.1898 & 0.2804 & 0.3241 \\
-    \midrule
-    Llama3.1-8B (RAG)
-    & 0.2436 & 0.3911 & 0.4029
-    & 0.1319 & 0.2530 & 0.2857
-    & 0.1304 & 0.2075 & 0.2245 \\
-    Qwen2.5-7B (RAG)
-    & 0.3550 & 0.6315 & 0.6790
-    & 0.2121 & 0.3655 & 0.4646
-    & 0.1879 & 0.2808 & 0.3250 \\
-    \midrule
-    Llama3.1-8B (RAG+LoRA)
-    & \underline{0.4722} & \underline{0.6940} & \underline{0.7363}
-    & \underline{0.3512} & \underline{0.6464} & \underline{0.7485}
-    & \underline{0.3512} & \underline{0.5729} & \underline{0.6595} \\
-    Qwen2.5-7B (RAG+LoRA)
-    & \textbf{0.4615} & \textbf{0.7179} & \textbf{0.8307}
-    & \textbf{0.3699} & \textbf{0.6388} & \textbf{0.7118}
-    & \textbf{0.3143} & \textbf{0.5767} & \textbf{0.6822} \\
-    \bottomrule
-    \end{tabular}
-}
-
-\small{
-Bold and underlined indicates statistically significant improvement \\(\ie using a two-sided t-test with $p<0.05$) over the best baseline.
-}
-\end{table}
-
-\begin{table}[ht]
-\centering
-\caption{Open-world Generative QA results.  
-Besides HR@\{5,10,20\} and NDCG@\{5,10,20\}, we include BERTScore\textsubscript{F1} (“BLEUScore” column) to measure lexical similarity.}
-\label{tab:generation_results}
-\small
-\resizebox{1.\linewidth}{!}{
-    \begin{tabular}{l|ccc|ccc|c}
-    \toprule
-    \multirow{2}{*}{\textbf{Model}}
-    & \multicolumn{3}{c|}{\textbf{Hit Ratio (Full Match)}}
-    & \multicolumn{3}{c|}{\textbf{NDCG  (Full Match)}} 
-    &     \multirow{2}{*}{\textbf{BLEUScore}}
-    \\
-    \cmidrule(lr){2-4} \cmidrule(lr){5-7}
-    & \textbf{HR@5} & \textbf{HR@10} & \textbf{HR@20}
-    & \textbf{NDCG@5} & \textbf{NDCG@10} & \textbf{NDCG@20} \\
-    
-    \midrule 
-    Llama3.1-8B (zero-shot)
-    & 0.0075 & 0.0112 & 0.0146
-    & 0.0149 & 0.0244 & 0.0297
-    & 0.0332 \\
-    Qwen2.5-7B (zero-shot)
-    & 0.0119 & 0.0199 & 0.0234
-    & 0.0213 & 0.0390 & 0.0442
-    & 0.0254 \\
-    \midrule
-    Llama3.1-8B (LoRA)
-    & 0.0144 & 0.0241 & 0.0282
-    & 0.0320 & 0.0512 & 0.0589
-    & 0.2941 \\
-    Qwen2.5-7B (LoRA)
-    & 0.0220 & 0.0394 &0.0459
-    & 0.0464 & 0.0798 & 0.0940
-    & 0.3082 \\
-    \midrule
-    Llama3.1-8B (RAG)
-    & 0.0142 & 0.0232 & 0.0294
-    & 0.0338 & 0.0537 & 0.0640
-    & 0.4125 \\
-    Qwen2.5-7B (RAG)
-    & 0.0226 & 0.0441 & 0.0496
-    & 0.0484 & 0.0850 & 0.1048
-    & 0.5321 \\
-    \midrule
-    Llama3.1-8B (RAG+LoRA)
-    & \underline{0.0331} & \underline{0.0584} & \underline{0.0690}
-    & \underline{0.0725} & \underline{0.1276} & \textbf{0.1509}
-    & \underline{0.7729} \\
-    Qwen2.5-7B (RAG+LoRA)
-    & \textbf{0.0394} & \textbf{0.0616} & \textbf{0.0714}
-    & \textbf{0.0770} & \textbf{0.1289} & \underline{0.1508}
-    & \textbf{0.7911} \\
-    \bottomrule
-    \end{tabular}
-}
-
-\small{
-Bold and underlined indicates statistically significant improvement \\(\ie using a two-sided t-test with $p<0.05$) over the best baseline.
-}
-\end{table}
-
-\paragraph{Open-world Generative QA.}
-This task poses a greater challenge, as models are required not only to reason over complex spatiotemporal constraints but also to generate accurately formatted POI names.
-Taking $k=10$ as an instance,
-in the zero-shot setting, HR@10 drops to 0.0075 for Llama and 0.0119 for Qwen, and even the best-performing configuration, RAG combined with LoRA, achieves only 0.06 for HR@10 on average and 0.1283 for NDCG@10 on average.
-
-Despite the difficulty, both LoRA and RAG contribute positively.
-LoRA increases HR@10 by almost 100\%, RAG provides an additional improvement of about 110\%, and their combination yields a total gain of 6 times than the zero-shot setting.
-While the strict ranking metrics remain relatively low, the BLEUScore maintains relatively high when combining with RAG \& LoRA approaches, indicating that the generated outputs are often semantically similar to the label even when they do not match exactly.
-This finding highlights the necessity of controlling hallucination and ensuring accurate outputs in generative spatiotemporal QA tasks.
-However, the differentiated results also indicate that the proposed dataset requires a more precise spatiotemporal relationship analysis modeling method to improve its accuracy.
-
-\begin{table}[ht]
-\centering
-\caption{Performance on the human-paraphrased subset of \name.}
-\label{tab:human_results}
-\small
-\resizebox{1.\linewidth}{!}{
-    \begin{tabular}{l|ccc|ccc|c}
-    \toprule
-     \multicolumn{1}{l}{\multirow{2}{*}{\textbf{Task}}} 
-    & \multicolumn{3}{c}{\textbf{Hit Ratio}}
-    & \multicolumn{3}{c}{\textbf{NDCG}} 
-    & \multirow{2}{*}{\textbf{BLEUScore}}
-    \\
-    \cmidrule(lr){2-4} \cmidrule(lr){5-7}
-    & \textbf{HR@5} & \textbf{HR@10} & \textbf{HR@20}
-    & \textbf{NDCG@5} & \textbf{NDCG@10} & \textbf{NDCG@20} \\
-
-    \midrule 
-    Classification: Major Category
-    & 0.3493 & 0.5644 & 0.6701
-    & 0.6518 & 0.7774 & 0.8432
-    & - \\
-    Classification: Medium Category
-    & 0.2891 & 0.4150 & 0.4693
-    & 0.5119 & 0.6875 & 0.7861
-    & - \\
-    Classification: Subcategory
-    & 0.1833 & 0.3035 & 0.3481
-    & 0.4411 & 0.6012 & 0.7140
-    & - \\
-    \midrule
-    Generation:\quad\ POI Names
-    & 0.1548 & 0.1611 & 0.1984
-    & 0.2096 & 0.2667 & 0.2924
-    & 0.8655 \\
-    \bottomrule
-    \end{tabular}
-}
-
-\small{
-Bold and underlined indicates statistically significant improvement \\(\ie using a two-sided t-test with $p<0.05$) over the best baseline.
-}
-\end{table}
-
-\subsection{Human-Paraphrased Results}
-\label{exp:human_para}
-
-To assess how well the models generalize to natural user queries, we asked crowd-workers to paraphrase $N_{\text{para}}{=}1{,}000$ questions in \name's test data.
-Table~\ref{tab:human_results} reports the results for the zero-shot and the best baseline RAG+LoRA.
-Besides we report the result of the model finetuned on RAG+LoRA.
-Across the two base LLMs, the performance drop from template to paraphrased questions is quite significant, roughly 70\% on HR on average and 85\% on NDCG on average.
-
-\section{Related Work}
-
-\subsection{POI-related QA}
-In recent years, many works have been proposed on POI-related tasks, particularly with the rise of location-based services. 
-Early datasets often involved retrieving factual data from structured knowledge bases or user-generated content. 
-For instance, 
-POIReviewQA~\cite{mai2018poireviewqa} have been proposed to support open-domain search and QA by using Yelp reviews.
-Tourism Reviews are also involved in building POI recommendation questions~\cite{contractor2021answering}.
-More recently, MapQA~\cite{li2025mapqa} focuses on open-domain QA on geospatial entities and relationships, using geospatial data as the reference.
-
-
-While these datasets advance POI-related QA by leveraging user reviews and geospatial data, they primary focus on knowledge extraction from static information or direct user preference modeling, rather than systematically evaluating a model's spatiotemporal reasoning capabilities. Thus, we hope our dataset could serve as a complement to the existing POI-related QA research.
-
-\subsection{Spatiotemporal Reasoning}
-Spatiotemporal reasoning, which involves understanding and making inferences based on the combined dimensions of space and time, is crucial for many AI applications. In NLP and QA, several efforts have targeted temporal reasoning. 
-For example, recent datasets like TempQuestions~\cite{jia2018tempquestions} and the ComplexTempQA~\cite{gruber2024complextempqa} specifically focus on temporal question answering, with the latter tackling complex queries requiring across-time comparison and multi-hop temporal reasoning. On the spatial side, datasets like MapQA~\cite{li2025mapqa} evaluate the performance of geospatial reasoning by using map data directly. 
-
-However, many of these datasets treat temporal and spatial aspects with a primary focus on one or the other. \name~aims to fill this gap by providing QA that explicitly considers spatiotemporal dependency in the context of POI trajectories.
-
-\subsection{Spatiotemporal Foundation LLMs}
-
-LLMs have strong capabilities in general question answering, but there is still much room for spatiotemporal reasoning in specific dynamic real-world scenarios.
-Recently, research has increasingly focused on specialized adaptations to improve LLM's spatiotemporal understanding and reasoning.
-For instance, the CityGPT~\cite{feng2024citygpt} aims to empower the urban spatial cognition of LLMs by fine-tuning them with a specially constructed instruction dataset, CityInstruction, to introduce urban knowledge and enhance spatial reasoning for city-scale tasks. BIGCity~\cite{yu2024bigcity} proposes a universal spatiotemporal model for a unified analysis of diverse spatiotemporal data types.
-
-Besides, benchmarks like STBench~\cite{li2024stbench} assess LLMs on a range of spatio-temporal tasks, including knowledge comprehension, spatio-temporal reasoning, accurate computation, and downstream applications.
-Our \name~highlights the spatiotemporal-sensitive questions for evaluating models' spatiotemporal reasoning.
-\section{Conclusion}
-
-In this paper, we explored the importance of spatiotemporal reasoning in real-world tasks.
-We highlighted the limitations of existing QA datasets illustrating spatiotemporal-sensitive questions and introduced a novel dataset called \name\ to address these challenges.
-This dataset incorporates real-world de-privacy trajectory data and extensive human annotations, providing a comprehensive resource for evaluating spatiotemporal reasoning capabilities.
-
-Our analysis revealed significant performance drops in state-of-the-art models on refined POI prediction tasks, underscoring the need for improved spatiotemporal understanding. With its unique features, including bilingual support and diverse granularities, \name\ serves as a valuable benchmark for advancing research in intelligent recommendation systems. We believe it will play a pivotal role in developing more accurate and context-aware solutions for real-world applications.