AI Architecture Patterns for Transport Data Systems

Research document for the Global Intelligence System for Transport (GIST) Focus: Powering the "Transport Angel" AI chatbot Last updated: 2026-02-09

1. Overview

The Transport Angel is an AI chatbot that can query across heterogeneous transport datasets, answer questions about global transport systems, provide real-time information, and support decision-making. This document covers the architecture patterns, techniques, and considerations for building such a system.

2. Core Architecture: Retrieval Augmented Generation (RAG)

2.1 Why RAG for Transport Data

Large Language Models (LLMs) have broad world knowledge but:

Cannot access real-time data (training data has a cutoff)
Cannot query structured databases directly
May hallucinate facts about specific routes, schedules, or statistics
Have limited context windows relative to the volume of transport data

RAG solves this by retrieving relevant data from external sources and injecting it into the LLM's context before generating a response.

2.2 RAG Architecture for Transport

User Query
    |
    v
[Query Understanding] --> Classify intent, extract entities
    |                     (route name, stop name, city,
    |                      mode, time, etc.)
    v
[Query Routing] --> Determine which data sources to query
    |               (schedule DB, real-time feed,
    |                geospatial DB, knowledge base, etc.)
    v
[Data Retrieval] --> Execute queries against relevant sources
    |                (SQL, API calls, vector search,
    |                 geospatial queries)
    v
[Context Assembly] --> Format retrieved data for the LLM
    |                  (tables, summaries, maps, charts)
    v
[LLM Generation] --> Generate natural language response
    |                with citations to sources
    v
[Response Enrichment] --> Add map visualizations,
                          links, structured data cards

2.3 RAG Variants Applicable to Transport

Naive RAG: Simple vector similarity search to find relevant documents, stuff them into context. Not suitable for structured transport data.

Advanced RAG (with query planning):

Query decomposition: Break complex questions into sub-queries
Iterative retrieval: Use initial results to refine subsequent queries
Multi-source fusion: Combine results from multiple data sources

Graph RAG: Use a knowledge graph to traverse relationships between transport entities (stops connected to routes, routes operated by agencies, agencies serving cities). Generate context by following graph edges from entities mentioned in the query.

Structured RAG (Recommended for GIST):

Convert natural language queries into structured queries (SQL, API calls)
Execute against structured databases
Format results for LLM consumption
LLM generates natural language explanation of structured results

2.4 Vector Embeddings for Transport Data

What to embed:

Stop/station descriptions and metadata
Route descriptions and characteristics
Agency/operator information
Transport policy documents
News articles about transport
FAQ/help content about transport systems

What NOT to embed (use structured queries instead):

Timetable data (better served by SQL queries against GTFS tables)
Real-time vehicle positions (better served by geospatial queries)
Fare calculations (better served by fare engine APIs)
Statistical data (better served by analytical queries)

Embedding models: Modern embedding models (e.g., OpenAI text-embedding-3, Cohere embed, open-source models like BGE, E5, GTE) work well for transport entity descriptions. Consider multilingual embedding models for a global system.

Vector databases: pgvector (PostgreSQL extension), Qdrant, Weaviate, Pinecone, Milvus. For GIST, pgvector is recommended (keeps vectors in the same database as spatial data in PostGIS, simplifying the architecture).

3. Natural Language to Query (NL2SQL/NL2Query)

3.1 The Challenge

Users will ask questions like:

"What buses run from Central Station to the airport after 9 PM?"
"How many bike-share stations are within 500 meters of metro stops in Paris?"
"Show me the busiest ferry routes in Scandinavia"
"Compare the average delay of trains in Germany vs France this month"

These require generating structured queries against diverse data sources.

3.2 NL2SQL Approaches

Direct LLM SQL generation:

Provide the database schema to the LLM (table names, column descriptions, sample data)
LLM generates SQL query
Execute query and return results
Works surprisingly well for simple to moderate queries with modern LLMs (GPT-4, Claude, etc.)

Challenges specific to transport data:

Complex schemas (GTFS has non-obvious relationships, e.g., trips.txt -> stop_times.txt -> stops.txt)
Spatial queries (PostGIS syntax is not in most LLMs' strong suit: ST_DWithin, ST_Distance, etc.)
Temporal queries (complex calendar logic in GTFS, timezone handling)
Cross-database queries (combining GTFS schedule with real-time data)
Ambiguous entity resolution ("Central Station" could be in many cities)

Mitigation strategies:

Schema documentation: Provide detailed descriptions of each table and column, not just names. Include example queries in the prompt.
Few-shot examples: Include 10-20 example question/SQL pairs in the prompt or fine-tuning dataset. Cover common query patterns:
- "What routes serve stop X?" (join routes, trips, stop_times, stops)
- "When is the next bus at stop X?" (temporal query with current time)
- "How far is stop X from stop Y?" (spatial distance query)
- "What stops are near location X?" (spatial proximity query)
Query templates: Pre-define parameterized SQL templates for common question types. Use the LLM to identify the template and extract parameters, rather than generating SQL from scratch.
Semantic layer: Build a semantic layer (e.g., using dbt metrics, Cube.js, or custom abstraction) that provides higher-level concepts ("next departure", "route frequency", "service area") that the LLM can reference instead of raw SQL.
Query validation: Validate generated SQL against the schema before execution. Check for common errors (wrong table names, invalid joins, missing WHERE clauses).
Query execution sandbox: Execute queries with timeouts, row limits, and read-only access. Prevent expensive full-table scans.

3.3 Multi-Database Query Orchestration

GIST will have multiple data sources. The AI agent needs to determine which source(s) to query:

Question Type	Data Source(s)	Query Type
Schedule/timetable	GTFS PostgreSQL DB	SQL
Real-time arrivals	GTFS-RT / SIRI feed	API call / cached data
Vehicle positions	Real-time stream / cache	Geospatial query
Bike availability	GBFS feed	API call / cached data
Traffic conditions	DATEX II / traffic DB	API call / SQL
Vessel positions	AIS database	Geospatial query
Flight status	Aviation API	API call
Infrastructure	OSM / PostGIS	Spatial SQL
Statistics/trends	Analytical DB (DuckDB)	Analytical SQL
Policy/documentation	Vector store	Semantic search
General knowledge	LLM knowledge	Direct generation

3.4 Text-to-SQL Tools and Frameworks

LangChain SQL Agent: Provides SQL database tooling for LLM agents. Supports schema introspection, query generation, error recovery.
LlamaIndex SQL tools: Similar to LangChain but with different abstractions.
Vanna.ai: Open-source text-to-SQL framework that learns from your data. Uses RAG over your specific schema and query history.
SQLCoder: Fine-tuned open-source models specifically for text-to-SQL (Defog.ai).
DuckDB + MotherDuck AI: DuckDB ecosystem has natural language query interfaces.

4. LLM Agent Architecture

4.1 Agent Pattern for Transport Angel

The Transport Angel should be implemented as an LLM agent with access to multiple tools, not just a simple RAG pipeline. An agent can:

Plan multi-step queries
Use different tools for different sub-tasks
Handle errors and retry with different strategies
Maintain conversation context

4.2 Agent Tools (Functions)

Define a set of tools the agent can invoke:

Data Query Tools:

query_schedule(origin, destination, datetime, mode) -- Find routes and schedules
query_realtime(stop_id, route_id) -- Get real-time arrivals/predictions
query_vehicle_positions(bbox, mode, agency) -- Get current vehicle locations
query_bike_availability(location, radius) -- Check shared bike/scooter availability
query_traffic(corridor, datetime) -- Get traffic conditions
query_vessel(mmsi, name, area) -- Find vessel information and position
query_flight(flight_number, route) -- Get flight information

Spatial Tools:

geocode(place_name) -- Convert place name to coordinates
reverse_geocode(lat, lon) -- Convert coordinates to place name
calculate_distance(origin, destination) -- Calculate distance between points
find_nearby(lat, lon, radius, category) -- Find transport facilities nearby
calculate_isochrone(origin, time_budget, mode) -- Calculate reachable area

Analytical Tools:

query_statistics(metric, aggregation, filters, time_range) -- Transport statistics
compare_systems(cities, metrics) -- Compare transport systems
trend_analysis(metric, time_range, granularity) -- Analyze trends over time

Knowledge Tools:

search_knowledge_base(query) -- Search transport documentation and policies
search_news(query, time_range) -- Search transport news and alerts

Visualization Tools:

generate_map(layers, bbox, style) -- Generate a map visualization
generate_chart(data, chart_type, options) -- Generate a chart

4.3 Agent Frameworks

LangChain / LangGraph:

Most widely adopted LLM application framework
LangGraph provides stateful agent workflows (graph of nodes and edges)
Supports tool calling, memory, streaming, multi-agent patterns
Large ecosystem of integrations
Strong candidate for GIST agent framework

LlamaIndex:

Data-centric LLM framework
Strong RAG capabilities, data connectors, and query engines
RouterQueryEngine can route queries to different data sources
SubQuestionQueryEngine decomposes complex queries
Strong candidate for data retrieval layer

CrewAI / AutoGen / Agency Swarm:

Multi-agent frameworks where specialized agents collaborate
Could model Transport Angel as a team: Schedule Agent, Real-Time Agent, Analytics Agent, Geography Agent
Each agent has specific tools and expertise
Consider for complex multi-step reasoning tasks

Anthropic Claude Tool Use / OpenAI Function Calling:

Native tool/function calling in modern LLMs
Structured output generation
Can be used directly without a framework for simpler agent patterns
Recommended as the base capability; framework adds orchestration on top

4.4 Conversation Memory and Context

Short-term memory: Conversation history within a session. Include relevant entities extracted from previous turns (user's city, preferred mode, recent stops mentioned).

Long-term memory: User preferences, frequently asked routes, accessibility needs. Store in user profile database.

Working memory: Intermediate results from multi-step queries. Keep in agent state during a conversation turn.

Shared context: Map viewport state (what the user is currently looking at on the map), time context (current vs planning for future).

5. Knowledge Graph for Transport Data

5.1 Why a Knowledge Graph

Transport data is inherently relational:

Stops belong to routes
Routes are operated by agencies
Agencies serve cities
Cities are in countries
Lines connect regions
Modes serve different purposes
Transfers connect different lines/modes

A knowledge graph captures these relationships explicitly, enabling:

Multi-hop reasoning ("What airlines fly from airports reachable by train from Paris?")
Entity disambiguation ("Central Station" in which city?)
Cross-modal journey planning context
Rich entity profiles (combining data from multiple sources about one entity)

5.2 Transport Knowledge Graph Ontology

Key entity types and relationships:

Country --contains--> Region --contains--> City --has--> TransportSystem
TransportSystem --operated_by--> Operator
TransportSystem --includes--> Network
Network --has_mode--> Mode (bus, metro, rail, ferry, bike, etc.)
Network --contains--> Line
Line --has--> Route
Route --follows--> RoutePattern
RoutePattern --visits--> StopPoint (ordered)
StopPoint --located_at--> StopPlace
StopPlace --has--> Quay/Platform
StopPlace --nearby--> StopPlace (transfers)
StopPlace --located_in--> City/Zone
VehicleJourney --on_route--> Route
VehicleJourney --scheduled_at--> DateTime
VehicleJourney --uses--> VehicleType
Fare --applies_to--> Route/Zone/Distance

5.3 Linked Data / Semantic Web Approaches

Existing transport ontologies:

Transmodel ontology: OWL/RDF representation of the Transmodel conceptual model. Published by CEN.
Linked GTFS: RDF vocabulary for GTFS data (http://vocab.gtfs.org/).
schema.org transport types: BusStation, TrainStation, Airport, BusTrip, Flight, etc.
Wikidata: Contains extensive transport infrastructure data with stable identifiers (Q-numbers). Airports, stations, railway lines, airlines, etc.
LinkedGeoData: RDF version of OpenStreetMap data.

SPARQL endpoints:

Wikidata Query Service (https://query.wikidata.org/) -- rich transport entity data
EU Open Data Portal -- some linked data
National statistics linked data endpoints

Relevance to GIST: A lightweight knowledge graph (using Neo4j, Amazon Neptune, or even PostgreSQL with ltree/recursive CTEs) could serve as the entity and relationship layer. Full semantic web / RDF infrastructure may be overkill for GIST, but leveraging existing ontologies for data modeling is valuable.

5.4 Practical Knowledge Graph Implementation

Recommended approach for GIST:

Build a property graph (not RDF triple store) in Neo4j or PostgreSQL:
- Nodes: Stops, Routes, Agencies, Cities, Countries, Modes
- Edges: serves, operates, located_in, connects_to, transfers_to
- Properties: names (multilingual), coordinates, identifiers (GTFS IDs, NeTEx IDs, OSM IDs, Wikidata IDs)
Populate from multiple sources:
- GTFS feeds for transit entities
- OSM for infrastructure
- Wikidata for metadata and identifiers
- Manual curation for high-level relationships
Use for:
- Entity resolution (mapping user queries to specific entities)
- Context enrichment (adding related information to RAG context)
- Navigation (finding paths through the transport network at a conceptual level)
- Cross-referencing (linking the same entity across different data sources)

6. Data Aggregation and ETL Patterns

6.1 Data Lake Architecture

Transport Data Lake layers:

[Raw Zone]           [Curated Zone]        [Serving Zone]
 |                    |                     |
 GTFS ZIPs           Normalized stops      PostGIS tables
 NeTEx XML           Normalized routes     Vector tiles
 SIRI XML/JSON       Unified schedules     API cache
 GBFS JSON           Trip records          Search index
 AIS NMEA            Vehicle positions     Knowledge graph
 DATEX II XML        Traffic events        Analytics cubes
 OSM PBF             Network topology      LLM embeddings

Raw Zone: Store original data in its native format. Preserve provenance (source URL, download timestamp, format version). Use object storage (S3, GCS, R2).

Curated Zone: Normalize data into a canonical schema. Clean, validate, deduplicate. Store in Parquet/GeoParquet format for efficient analytics.

Serving Zone: Optimized for query patterns (PostGIS for spatial queries, Elasticsearch for search, vector tiles for visualization, Redis/Memcached for real-time cache).

6.2 Data Mesh Considerations

In a Data Mesh approach, each transport domain owns its data:

Transit domain team owns GTFS/NeTEx data
Road domain team owns DATEX II / traffic data
Maritime domain team owns AIS / vessel data
Aviation domain team owns flight data
Shared mobility domain team owns GBFS/MDS data

Each domain publishes data as a data product with:

Clear schema and documentation
Quality SLAs
Self-service access APIs
Standard metadata

Assessment: Data mesh is more relevant for large organizations. For GIST as a platform, a centralized data lake with domain-specific ingestion pipelines is more pragmatic. Use data mesh principles (data ownership, quality SLAs, self-service) without the full organizational model.

6.3 ETL Pipeline Architecture

Recommended pipeline tools:

Component	Tool Options	Recommendation
Orchestration	Apache Airflow, Dagster, Prefect	Dagster (modern, data-aware, good testing)
Transformation	dbt, SQLMesh, custom Python	dbt for SQL transforms, Python for format conversion
Streaming	Apache Kafka, Apache Flink	Kafka for ingestion, Flink for complex stream processing
Quality	Great Expectations, Soda, dbt tests	Great Expectations + dbt tests
Format conversion	Custom Python, GDAL/OGR	GDAL/OGR for geospatial, custom for transport-specific

GTFS ETL pipeline example:

Discover: Scan Transitland, MobilityData Catalog, NAPs for new/updated GTFS feeds
Download: Fetch GTFS ZIP files, store in raw zone
Validate: Run MobilityData GTFS Validator (canonical validator)
Load: Import into PostgreSQL using gtfs-via-postgres, gtfsdb, or custom loader
Normalize: Map to canonical schema (standardize route types, agency names, stop identifiers)
Enrich: Add geocoded city/region, link to knowledge graph entities, compute derived metrics
Serve: Generate vector tiles, update search index, refresh API cache

Real-time pipeline example (GTFS-RT):

Poll: Fetch GTFS-RT feeds every 15-60 seconds
Decode: Parse protobuf messages
Validate: Check for stale data, missing fields, outlier positions
Publish: Write to Kafka topic (partitioned by region/agency)
Process: Kafka consumer updates PostGIS vehicle position table, calculates delay metrics
Serve: WebSocket server pushes updates to connected clients within relevant viewport

7.1 Map-Aware AI

The Transport Angel should be aware of what the user sees on the map:

Viewport context: Know the current map extent and zoom level. Use this to scope queries geographically.
Selected features: Know which stops, routes, or vehicles the user has clicked/selected.
Visual query: User can draw on the map (circle an area, draw a route) and ask questions about that area.
Map as output: AI can respond with map actions (pan to location, highlight route, show isochrone, add data layer).

7.2 Chart and Data Visualization Generation

The AI should be able to generate visualizations as part of responses:

Time-series charts (ridership trends, delay patterns)
Bar charts (mode share comparison, busiest routes)
Maps with highlighted features
Tables with formatted data

Approach: Use tool calling to invoke visualization functions. Return structured data that the frontend renders (not images generated by the AI).

7.3 Multimodal Input

Future capability: Users could upload or share:

Photos of bus stops or station signs (OCR to identify location)
Screenshots of timetables (extract schedule data)
PDF transport documents (parse and index)
Voice input (especially for mobile/accessibility)

7.4 Multilingual Support

A global transport system must support queries in many languages:

Use multilingual LLMs (Claude, GPT-4, Gemini all support many languages)
Multilingual embedding models for vector search
Multilingual stop/station name matching (handle transliterations, multiple scripts)
Language detection and response in user's preferred language
Cross-language entity resolution ("Gare du Nord" = "North Station" = "Nordbahnhof")

8. Example Interaction Patterns for Transport Angel

8.1 Simple Factual Query

User: "When is the next bus from Alexanderplatz to Potsdamer Platz?" Agent actions:

Geocode "Alexanderplatz" and "Potsdamer Platz" (or resolve as known stops in Berlin)
Query GTFS schedule + real-time for Berlin (BVG feed)
Return next departures with real-time predictions

8.2 Comparative Analysis

User: "Compare the metro systems of Tokyo and London by ridership, network length, and average frequency" Agent actions:

Query knowledge graph for Tokyo Metro + Toei + London Underground entities
Query statistics database for ridership, network metrics
Query GTFS feeds for frequency calculation (if available)
Synthesize comparison table
Generate comparison chart

8.3 Geospatial Query

User: "What public transport options are within 15 minutes walk of the Eiffel Tower?" Agent actions:

Geocode "Eiffel Tower" (known landmark)
Calculate walking isochrone (15 minutes ~= 1.2 km radius)
Query stops within isochrone polygon
Group by mode (metro, bus, RER, tram)
Return results with map visualization showing isochrone and stops

8.4 Real-Time Situational

User: "Are there any disruptions on the London Underground right now?" Agent actions:

Query TfL SIRI-SX / alerts feed for current disruptions
Query GTFS-RT alerts for London
Format disruption information by line
Show affected lines on map

8.5 Policy/Knowledge Query

User: "What data format does the European Union require for publishing transit schedules?" Agent actions:

Search knowledge base for EU transport data regulations
Retrieve information about ITS Directive, Delegated Regulation 2017/1926, NeTEx
Generate comprehensive answer with citations

9. Technical Considerations

9.1 Latency Budget

For a conversational AI, target response times:

Simple factual queries: < 3 seconds
Multi-step queries: < 8 seconds (with streaming partial results)
Complex analytical queries: < 15 seconds (with progress indicators)

Optimization strategies:

Cache frequently queried data (popular stops, common routes)
Pre-compute common aggregations
Use streaming LLM responses (show text as it generates)
Execute independent tool calls in parallel
Use faster models for simple classification/routing, larger models for generation

9.2 Accuracy and Hallucination Prevention

Transport data requires high accuracy (wrong schedule = missed connection):

Always ground answers in retrieved data, not LLM knowledge
Include data freshness timestamps in responses
Cite specific data sources
Use structured output (JSON mode) for data queries, not free-form generation
Implement confidence scoring: if data is uncertain, say so
Validate generated SQL before execution
Cross-check real-time data against schedule data for plausibility

9.3 Privacy and Safety

Do not track individual user journeys unless explicitly opted in
Aggregate real-time vehicle position data for privacy (avoid identifying specific vehicles by pattern)
Content safety for user-generated queries
Rate limiting to prevent abuse
Data access controls (some transport data may have licensing restrictions)

9.4 Cost Management

LLM API costs can scale significantly:

Use smaller/cheaper models for classification and routing
Use larger models only for complex generation
Cache LLM responses for identical queries
Batch embedding generation
Monitor and optimize token usage
Consider open-source models (Llama, Mistral) for cost-sensitive operations

10. Reference Implementations and Inspiration

10.1 Existing AI + Transport/Geospatial Systems

Google Maps AI features: Natural language search for places, transit directions, area exploration. Increasingly uses AI for personalized recommendations.
Citymapper: Multimodal routing with real-time data integration. Uses ML for arrival predictions.
Transit App: Real-time transit information with ML-enhanced predictions (crowdsourced).
Moovit (Intel): Global transit data platform with AI-enhanced routing and predictions.
Remix (Via): Transit planning platform with geospatial analytics.
Mapbox AI: AI-powered navigation and mapping features.
Azure Maps + OpenAI: Microsoft's integration of LLMs with geospatial data.
Overture Maps + AI: Overture Maps Foundation is building open map data that could be AI-queryable.

10.2 Research Papers and Projects

GeoLLM: Research on grounding LLMs in geographic knowledge
SpaBERT: Spatial language understanding
LLM4GIS: Using LLMs for geospatial information systems
TURL (Table Understanding using Relational Learning): Relevant for understanding transport tabular data
Text2SQL benchmarks: Spider, BIRD, WikiSQL -- for evaluating NL-to-SQL capabilities

11. Recommended Architecture Summary

                    +-------------------+
                    |   Transport Angel |
                    |   (LLM Agent)     |
                    +--------+----------+
                             |
                    +--------v----------+
                    | Agent Orchestrator |
                    | (LangGraph/custom) |
                    +--------+----------+
                             |
            +----------------+----------------+
            |                |                |
     +------v------+  +-----v------+  +------v------+
     | Query Tools |  | Spatial    |  | Knowledge   |
     | (NL2SQL,    |  | Tools      |  | Tools       |
     |  API calls) |  | (geocode,  |  | (vector     |
     |             |  |  isochrone, |  |  search,    |
     |             |  |  nearby)   |  |  graph)     |
     +------+------+  +-----+------+  +------+------+
            |                |                |
     +------v------+  +-----v------+  +------v------+
     | Data Sources|  | PostGIS /  |  | Vector DB / |
     | (GTFS DB,   |  | pgRouting  |  | Knowledge   |
     |  RT cache,  |  |            |  | Graph       |
     |  Analytics) |  |            |  |             |
     +-------------+  +------------+  +-------------+

Key design principles:

Tool-based architecture: LLM selects and invokes tools; does not access data directly
Structured retrieval over vector search: Use SQL/API queries for structured transport data; reserve vector search for unstructured content
Ground truth over generation: Always prefer retrieved data over LLM knowledge for factual answers
Progressive disclosure: Start with a concise answer; offer to drill deeper
Map-integrated: Responses can include spatial actions (show on map, highlight, navigate)
Source attribution: Every data point includes its source and freshness

12. References

LangChain documentation: https://python.langchain.com/
LangGraph: https://langchain-ai.github.io/langgraph/
LlamaIndex: https://docs.llamaindex.ai/
Vanna.ai (text-to-SQL): https://vanna.ai/
pgvector: https://github.com/pgvector/pgvector
Anthropic Claude tool use: https://docs.anthropic.com/claude/docs/tool-use
OpenAI function calling: https://platform.openai.com/docs/guides/function-calling
Transmodel ontology: https://transmodel-cen.eu/
Linked GTFS: http://vocab.gtfs.org/
Wikidata Query Service: https://query.wikidata.org/
SQLCoder: https://github.com/defog-ai/sqlcoder
H3 for spatial indexing: https://h3geo.org/

AI Architecture Patterns