AI Architecture Patterns
Agentic RAG, progressive discovery, and knowledge graphs
AI Architecture Patterns for Transport Data Systems
Research document for the Global Intelligence System for Transport (GIST) Focus: Powering the "Transport Angel" AI chatbot Last updated: 2026-02-09
1. Overview
The Transport Angel is an AI chatbot that can query across heterogeneous transport datasets, answer questions about global transport systems, provide real-time information, and support decision-making. This document covers the architecture patterns, techniques, and considerations for building such a system.
2. Core Architecture: Retrieval Augmented Generation (RAG)
2.1 Why RAG for Transport Data
Large Language Models (LLMs) have broad world knowledge but:
- Cannot access real-time data (training data has a cutoff)
- Cannot query structured databases directly
- May hallucinate facts about specific routes, schedules, or statistics
- Have limited context windows relative to the volume of transport data
RAG solves this by retrieving relevant data from external sources and injecting it into the LLM's context before generating a response.
2.2 RAG Architecture for Transport
User Query
|
v
[Query Understanding] --> Classify intent, extract entities
| (route name, stop name, city,
| mode, time, etc.)
v
[Query Routing] --> Determine which data sources to query
| (schedule DB, real-time feed,
| geospatial DB, knowledge base, etc.)
v
[Data Retrieval] --> Execute queries against relevant sources
| (SQL, API calls, vector search,
| geospatial queries)
v
[Context Assembly] --> Format retrieved data for the LLM
| (tables, summaries, maps, charts)
v
[LLM Generation] --> Generate natural language response
| with citations to sources
v
[Response Enrichment] --> Add map visualizations,
links, structured data cards
2.3 RAG Variants Applicable to Transport
Naive RAG: Simple vector similarity search to find relevant documents, stuff them into context. Not suitable for structured transport data.
Advanced RAG (with query planning):
- Query decomposition: Break complex questions into sub-queries
- Iterative retrieval: Use initial results to refine subsequent queries
- Multi-source fusion: Combine results from multiple data sources
Graph RAG: Use a knowledge graph to traverse relationships between transport entities (stops connected to routes, routes operated by agencies, agencies serving cities). Generate context by following graph edges from entities mentioned in the query.
Structured RAG (Recommended for GIST):
- Convert natural language queries into structured queries (SQL, API calls)
- Execute against structured databases
- Format results for LLM consumption
- LLM generates natural language explanation of structured results
2.4 Vector Embeddings for Transport Data
What to embed:
- Stop/station descriptions and metadata
- Route descriptions and characteristics
- Agency/operator information
- Transport policy documents
- News articles about transport
- FAQ/help content about transport systems
What NOT to embed (use structured queries instead):
- Timetable data (better served by SQL queries against GTFS tables)
- Real-time vehicle positions (better served by geospatial queries)
- Fare calculations (better served by fare engine APIs)
- Statistical data (better served by analytical queries)
Embedding models: Modern embedding models (e.g., OpenAI text-embedding-3, Cohere embed, open-source models like BGE, E5, GTE) work well for transport entity descriptions. Consider multilingual embedding models for a global system.
Vector databases: pgvector (PostgreSQL extension), Qdrant, Weaviate, Pinecone, Milvus. For GIST, pgvector is recommended (keeps vectors in the same database as spatial data in PostGIS, simplifying the architecture).
3. Natural Language to Query (NL2SQL/NL2Query)
3.1 The Challenge
Users will ask questions like:
- "What buses run from Central Station to the airport after 9 PM?"
- "How many bike-share stations are within 500 meters of metro stops in Paris?"
- "Show me the busiest ferry routes in Scandinavia"
- "Compare the average delay of trains in Germany vs France this month"
These require generating structured queries against diverse data sources.
3.2 NL2SQL Approaches
Direct LLM SQL generation:
- Provide the database schema to the LLM (table names, column descriptions, sample data)
- LLM generates SQL query
- Execute query and return results
- Works surprisingly well for simple to moderate queries with modern LLMs (GPT-4, Claude, etc.)
Challenges specific to transport data:
- Complex schemas (GTFS has non-obvious relationships, e.g., trips.txt -> stop_times.txt -> stops.txt)
- Spatial queries (PostGIS syntax is not in most LLMs' strong suit:
ST_DWithin,ST_Distance, etc.) - Temporal queries (complex calendar logic in GTFS, timezone handling)
- Cross-database queries (combining GTFS schedule with real-time data)
- Ambiguous entity resolution ("Central Station" could be in many cities)
Mitigation strategies:
-
Schema documentation: Provide detailed descriptions of each table and column, not just names. Include example queries in the prompt.
-
Few-shot examples: Include 10-20 example question/SQL pairs in the prompt or fine-tuning dataset. Cover common query patterns:
- "What routes serve stop X?" (join routes, trips, stop_times, stops)
- "When is the next bus at stop X?" (temporal query with current time)
- "How far is stop X from stop Y?" (spatial distance query)
- "What stops are near location X?" (spatial proximity query)
-
Query templates: Pre-define parameterized SQL templates for common question types. Use the LLM to identify the template and extract parameters, rather than generating SQL from scratch.
-
Semantic layer: Build a semantic layer (e.g., using dbt metrics, Cube.js, or custom abstraction) that provides higher-level concepts ("next departure", "route frequency", "service area") that the LLM can reference instead of raw SQL.
-
Query validation: Validate generated SQL against the schema before execution. Check for common errors (wrong table names, invalid joins, missing WHERE clauses).
-
Query execution sandbox: Execute queries with timeouts, row limits, and read-only access. Prevent expensive full-table scans.
3.3 Multi-Database Query Orchestration
GIST will have multiple data sources. The AI agent needs to determine which source(s) to query:
| Question Type | Data Source(s) | Query Type |
|---|---|---|
| Schedule/timetable | GTFS PostgreSQL DB | SQL |
| Real-time arrivals | GTFS-RT / SIRI feed | API call / cached data |
| Vehicle positions | Real-time stream / cache | Geospatial query |
| Bike availability | GBFS feed | API call / cached data |
| Traffic conditions | DATEX II / traffic DB | API call / SQL |
| Vessel positions | AIS database | Geospatial query |
| Flight status | Aviation API | API call |
| Infrastructure | OSM / PostGIS | Spatial SQL |
| Statistics/trends | Analytical DB (DuckDB) | Analytical SQL |
| Policy/documentation | Vector store | Semantic search |
| General knowledge | LLM knowledge | Direct generation |
3.4 Text-to-SQL Tools and Frameworks
- LangChain SQL Agent: Provides SQL database tooling for LLM agents. Supports schema introspection, query generation, error recovery.
- LlamaIndex SQL tools: Similar to LangChain but with different abstractions.
- Vanna.ai: Open-source text-to-SQL framework that learns from your data. Uses RAG over your specific schema and query history.
- SQLCoder: Fine-tuned open-source models specifically for text-to-SQL (Defog.ai).
- DuckDB + MotherDuck AI: DuckDB ecosystem has natural language query interfaces.
4. LLM Agent Architecture
4.1 Agent Pattern for Transport Angel
The Transport Angel should be implemented as an LLM agent with access to multiple tools, not just a simple RAG pipeline. An agent can:
- Plan multi-step queries
- Use different tools for different sub-tasks
- Handle errors and retry with different strategies
- Maintain conversation context
4.2 Agent Tools (Functions)
Define a set of tools the agent can invoke:
Data Query Tools:
query_schedule(origin, destination, datetime, mode)-- Find routes and schedulesquery_realtime(stop_id, route_id)-- Get real-time arrivals/predictionsquery_vehicle_positions(bbox, mode, agency)-- Get current vehicle locationsquery_bike_availability(location, radius)-- Check shared bike/scooter availabilityquery_traffic(corridor, datetime)-- Get traffic conditionsquery_vessel(mmsi, name, area)-- Find vessel information and positionquery_flight(flight_number, route)-- Get flight information
Spatial Tools:
geocode(place_name)-- Convert place name to coordinatesreverse_geocode(lat, lon)-- Convert coordinates to place namecalculate_distance(origin, destination)-- Calculate distance between pointsfind_nearby(lat, lon, radius, category)-- Find transport facilities nearbycalculate_isochrone(origin, time_budget, mode)-- Calculate reachable area
Analytical Tools:
query_statistics(metric, aggregation, filters, time_range)-- Transport statisticscompare_systems(cities, metrics)-- Compare transport systemstrend_analysis(metric, time_range, granularity)-- Analyze trends over time
Knowledge Tools:
search_knowledge_base(query)-- Search transport documentation and policiessearch_news(query, time_range)-- Search transport news and alerts
Visualization Tools:
generate_map(layers, bbox, style)-- Generate a map visualizationgenerate_chart(data, chart_type, options)-- Generate a chart
4.3 Agent Frameworks
LangChain / LangGraph:
- Most widely adopted LLM application framework
- LangGraph provides stateful agent workflows (graph of nodes and edges)
- Supports tool calling, memory, streaming, multi-agent patterns
- Large ecosystem of integrations
- Strong candidate for GIST agent framework
LlamaIndex:
- Data-centric LLM framework
- Strong RAG capabilities, data connectors, and query engines
- RouterQueryEngine can route queries to different data sources
- SubQuestionQueryEngine decomposes complex queries
- Strong candidate for data retrieval layer
CrewAI / AutoGen / Agency Swarm:
- Multi-agent frameworks where specialized agents collaborate
- Could model Transport Angel as a team: Schedule Agent, Real-Time Agent, Analytics Agent, Geography Agent
- Each agent has specific tools and expertise
- Consider for complex multi-step reasoning tasks
Anthropic Claude Tool Use / OpenAI Function Calling:
- Native tool/function calling in modern LLMs
- Structured output generation
- Can be used directly without a framework for simpler agent patterns
- Recommended as the base capability; framework adds orchestration on top
4.4 Conversation Memory and Context
Short-term memory: Conversation history within a session. Include relevant entities extracted from previous turns (user's city, preferred mode, recent stops mentioned).
Long-term memory: User preferences, frequently asked routes, accessibility needs. Store in user profile database.
Working memory: Intermediate results from multi-step queries. Keep in agent state during a conversation turn.
Shared context: Map viewport state (what the user is currently looking at on the map), time context (current vs planning for future).
5. Knowledge Graph for Transport Data
5.1 Why a Knowledge Graph
Transport data is inherently relational:
- Stops belong to routes
- Routes are operated by agencies
- Agencies serve cities
- Cities are in countries
- Lines connect regions
- Modes serve different purposes
- Transfers connect different lines/modes
A knowledge graph captures these relationships explicitly, enabling:
- Multi-hop reasoning ("What airlines fly from airports reachable by train from Paris?")
- Entity disambiguation ("Central Station" in which city?)
- Cross-modal journey planning context
- Rich entity profiles (combining data from multiple sources about one entity)
5.2 Transport Knowledge Graph Ontology
Key entity types and relationships:
Country --contains--> Region --contains--> City --has--> TransportSystem
TransportSystem --operated_by--> Operator
TransportSystem --includes--> Network
Network --has_mode--> Mode (bus, metro, rail, ferry, bike, etc.)
Network --contains--> Line
Line --has--> Route
Route --follows--> RoutePattern
RoutePattern --visits--> StopPoint (ordered)
StopPoint --located_at--> StopPlace
StopPlace --has--> Quay/Platform
StopPlace --nearby--> StopPlace (transfers)
StopPlace --located_in--> City/Zone
VehicleJourney --on_route--> Route
VehicleJourney --scheduled_at--> DateTime
VehicleJourney --uses--> VehicleType
Fare --applies_to--> Route/Zone/Distance
5.3 Linked Data / Semantic Web Approaches
Existing transport ontologies:
- Transmodel ontology: OWL/RDF representation of the Transmodel conceptual model. Published by CEN.
- Linked GTFS: RDF vocabulary for GTFS data (http://vocab.gtfs.org/).
- schema.org transport types: BusStation, TrainStation, Airport, BusTrip, Flight, etc.
- Wikidata: Contains extensive transport infrastructure data with stable identifiers (Q-numbers). Airports, stations, railway lines, airlines, etc.
- LinkedGeoData: RDF version of OpenStreetMap data.
SPARQL endpoints:
- Wikidata Query Service (https://query.wikidata.org/) -- rich transport entity data
- EU Open Data Portal -- some linked data
- National statistics linked data endpoints
Relevance to GIST: A lightweight knowledge graph (using Neo4j, Amazon Neptune, or even PostgreSQL with ltree/recursive CTEs) could serve as the entity and relationship layer. Full semantic web / RDF infrastructure may be overkill for GIST, but leveraging existing ontologies for data modeling is valuable.
5.4 Practical Knowledge Graph Implementation
Recommended approach for GIST:
-
Build a property graph (not RDF triple store) in Neo4j or PostgreSQL:
- Nodes: Stops, Routes, Agencies, Cities, Countries, Modes
- Edges: serves, operates, located_in, connects_to, transfers_to
- Properties: names (multilingual), coordinates, identifiers (GTFS IDs, NeTEx IDs, OSM IDs, Wikidata IDs)
-
Populate from multiple sources:
- GTFS feeds for transit entities
- OSM for infrastructure
- Wikidata for metadata and identifiers
- Manual curation for high-level relationships
-
Use for:
- Entity resolution (mapping user queries to specific entities)
- Context enrichment (adding related information to RAG context)
- Navigation (finding paths through the transport network at a conceptual level)
- Cross-referencing (linking the same entity across different data sources)
6. Data Aggregation and ETL Patterns
6.1 Data Lake Architecture
Transport Data Lake layers:
[Raw Zone] [Curated Zone] [Serving Zone]
| | |
GTFS ZIPs Normalized stops PostGIS tables
NeTEx XML Normalized routes Vector tiles
SIRI XML/JSON Unified schedules API cache
GBFS JSON Trip records Search index
AIS NMEA Vehicle positions Knowledge graph
DATEX II XML Traffic events Analytics cubes
OSM PBF Network topology LLM embeddings
Raw Zone: Store original data in its native format. Preserve provenance (source URL, download timestamp, format version). Use object storage (S3, GCS, R2).
Curated Zone: Normalize data into a canonical schema. Clean, validate, deduplicate. Store in Parquet/GeoParquet format for efficient analytics.
Serving Zone: Optimized for query patterns (PostGIS for spatial queries, Elasticsearch for search, vector tiles for visualization, Redis/Memcached for real-time cache).
6.2 Data Mesh Considerations
In a Data Mesh approach, each transport domain owns its data:
- Transit domain team owns GTFS/NeTEx data
- Road domain team owns DATEX II / traffic data
- Maritime domain team owns AIS / vessel data
- Aviation domain team owns flight data
- Shared mobility domain team owns GBFS/MDS data
Each domain publishes data as a data product with:
- Clear schema and documentation
- Quality SLAs
- Self-service access APIs
- Standard metadata
Assessment: Data mesh is more relevant for large organizations. For GIST as a platform, a centralized data lake with domain-specific ingestion pipelines is more pragmatic. Use data mesh principles (data ownership, quality SLAs, self-service) without the full organizational model.
6.3 ETL Pipeline Architecture
Recommended pipeline tools:
| Component | Tool Options | Recommendation |
|---|---|---|
| Orchestration | Apache Airflow, Dagster, Prefect | Dagster (modern, data-aware, good testing) |
| Transformation | dbt, SQLMesh, custom Python | dbt for SQL transforms, Python for format conversion |
| Streaming | Apache Kafka, Apache Flink | Kafka for ingestion, Flink for complex stream processing |
| Quality | Great Expectations, Soda, dbt tests | Great Expectations + dbt tests |
| Format conversion | Custom Python, GDAL/OGR | GDAL/OGR for geospatial, custom for transport-specific |
GTFS ETL pipeline example:
- Discover: Scan Transitland, MobilityData Catalog, NAPs for new/updated GTFS feeds
- Download: Fetch GTFS ZIP files, store in raw zone
- Validate: Run MobilityData GTFS Validator (canonical validator)
- Load: Import into PostgreSQL using gtfs-via-postgres, gtfsdb, or custom loader
- Normalize: Map to canonical schema (standardize route types, agency names, stop identifiers)
- Enrich: Add geocoded city/region, link to knowledge graph entities, compute derived metrics
- Serve: Generate vector tiles, update search index, refresh API cache
Real-time pipeline example (GTFS-RT):
- Poll: Fetch GTFS-RT feeds every 15-60 seconds
- Decode: Parse protobuf messages
- Validate: Check for stale data, missing fields, outlier positions
- Publish: Write to Kafka topic (partitioned by region/agency)
- Process: Kafka consumer updates PostGIS vehicle position table, calculates delay metrics
- Serve: WebSocket server pushes updates to connected clients within relevant viewport
7. Multi-Modal AI Capabilities
7.1 Map-Aware AI
The Transport Angel should be aware of what the user sees on the map:
- Viewport context: Know the current map extent and zoom level. Use this to scope queries geographically.
- Selected features: Know which stops, routes, or vehicles the user has clicked/selected.
- Visual query: User can draw on the map (circle an area, draw a route) and ask questions about that area.
- Map as output: AI can respond with map actions (pan to location, highlight route, show isochrone, add data layer).
7.2 Chart and Data Visualization Generation
The AI should be able to generate visualizations as part of responses:
- Time-series charts (ridership trends, delay patterns)
- Bar charts (mode share comparison, busiest routes)
- Maps with highlighted features
- Tables with formatted data
Approach: Use tool calling to invoke visualization functions. Return structured data that the frontend renders (not images generated by the AI).
7.3 Multimodal Input
Future capability: Users could upload or share:
- Photos of bus stops or station signs (OCR to identify location)
- Screenshots of timetables (extract schedule data)
- PDF transport documents (parse and index)
- Voice input (especially for mobile/accessibility)
7.4 Multilingual Support
A global transport system must support queries in many languages:
- Use multilingual LLMs (Claude, GPT-4, Gemini all support many languages)
- Multilingual embedding models for vector search
- Multilingual stop/station name matching (handle transliterations, multiple scripts)
- Language detection and response in user's preferred language
- Cross-language entity resolution ("Gare du Nord" = "North Station" = "Nordbahnhof")
8. Example Interaction Patterns for Transport Angel
8.1 Simple Factual Query
User: "When is the next bus from Alexanderplatz to Potsdamer Platz?" Agent actions:
- Geocode "Alexanderplatz" and "Potsdamer Platz" (or resolve as known stops in Berlin)
- Query GTFS schedule + real-time for Berlin (BVG feed)
- Return next departures with real-time predictions
8.2 Comparative Analysis
User: "Compare the metro systems of Tokyo and London by ridership, network length, and average frequency" Agent actions:
- Query knowledge graph for Tokyo Metro + Toei + London Underground entities
- Query statistics database for ridership, network metrics
- Query GTFS feeds for frequency calculation (if available)
- Synthesize comparison table
- Generate comparison chart
8.3 Geospatial Query
User: "What public transport options are within 15 minutes walk of the Eiffel Tower?" Agent actions:
- Geocode "Eiffel Tower" (known landmark)
- Calculate walking isochrone (15 minutes ~= 1.2 km radius)
- Query stops within isochrone polygon
- Group by mode (metro, bus, RER, tram)
- Return results with map visualization showing isochrone and stops
8.4 Real-Time Situational
User: "Are there any disruptions on the London Underground right now?" Agent actions:
- Query TfL SIRI-SX / alerts feed for current disruptions
- Query GTFS-RT alerts for London
- Format disruption information by line
- Show affected lines on map
8.5 Policy/Knowledge Query
User: "What data format does the European Union require for publishing transit schedules?" Agent actions:
- Search knowledge base for EU transport data regulations
- Retrieve information about ITS Directive, Delegated Regulation 2017/1926, NeTEx
- Generate comprehensive answer with citations
9. Technical Considerations
9.1 Latency Budget
For a conversational AI, target response times:
- Simple factual queries: < 3 seconds
- Multi-step queries: < 8 seconds (with streaming partial results)
- Complex analytical queries: < 15 seconds (with progress indicators)
Optimization strategies:
- Cache frequently queried data (popular stops, common routes)
- Pre-compute common aggregations
- Use streaming LLM responses (show text as it generates)
- Execute independent tool calls in parallel
- Use faster models for simple classification/routing, larger models for generation
9.2 Accuracy and Hallucination Prevention
Transport data requires high accuracy (wrong schedule = missed connection):
- Always ground answers in retrieved data, not LLM knowledge
- Include data freshness timestamps in responses
- Cite specific data sources
- Use structured output (JSON mode) for data queries, not free-form generation
- Implement confidence scoring: if data is uncertain, say so
- Validate generated SQL before execution
- Cross-check real-time data against schedule data for plausibility
9.3 Privacy and Safety
- Do not track individual user journeys unless explicitly opted in
- Aggregate real-time vehicle position data for privacy (avoid identifying specific vehicles by pattern)
- Content safety for user-generated queries
- Rate limiting to prevent abuse
- Data access controls (some transport data may have licensing restrictions)
9.4 Cost Management
LLM API costs can scale significantly:
- Use smaller/cheaper models for classification and routing
- Use larger models only for complex generation
- Cache LLM responses for identical queries
- Batch embedding generation
- Monitor and optimize token usage
- Consider open-source models (Llama, Mistral) for cost-sensitive operations
10. Reference Implementations and Inspiration
10.1 Existing AI + Transport/Geospatial Systems
- Google Maps AI features: Natural language search for places, transit directions, area exploration. Increasingly uses AI for personalized recommendations.
- Citymapper: Multimodal routing with real-time data integration. Uses ML for arrival predictions.
- Transit App: Real-time transit information with ML-enhanced predictions (crowdsourced).
- Moovit (Intel): Global transit data platform with AI-enhanced routing and predictions.
- Remix (Via): Transit planning platform with geospatial analytics.
- Mapbox AI: AI-powered navigation and mapping features.
- Azure Maps + OpenAI: Microsoft's integration of LLMs with geospatial data.
- Overture Maps + AI: Overture Maps Foundation is building open map data that could be AI-queryable.
10.2 Research Papers and Projects
- GeoLLM: Research on grounding LLMs in geographic knowledge
- SpaBERT: Spatial language understanding
- LLM4GIS: Using LLMs for geospatial information systems
- TURL (Table Understanding using Relational Learning): Relevant for understanding transport tabular data
- Text2SQL benchmarks: Spider, BIRD, WikiSQL -- for evaluating NL-to-SQL capabilities
11. Recommended Architecture Summary
+-------------------+
| Transport Angel |
| (LLM Agent) |
+--------+----------+
|
+--------v----------+
| Agent Orchestrator |
| (LangGraph/custom) |
+--------+----------+
|
+----------------+----------------+
| | |
+------v------+ +-----v------+ +------v------+
| Query Tools | | Spatial | | Knowledge |
| (NL2SQL, | | Tools | | Tools |
| API calls) | | (geocode, | | (vector |
| | | isochrone, | | search, |
| | | nearby) | | graph) |
+------+------+ +-----+------+ +------+------+
| | |
+------v------+ +-----v------+ +------v------+
| Data Sources| | PostGIS / | | Vector DB / |
| (GTFS DB, | | pgRouting | | Knowledge |
| RT cache, | | | | Graph |
| Analytics) | | | | |
+-------------+ +------------+ +-------------+
Key design principles:
- Tool-based architecture: LLM selects and invokes tools; does not access data directly
- Structured retrieval over vector search: Use SQL/API queries for structured transport data; reserve vector search for unstructured content
- Ground truth over generation: Always prefer retrieved data over LLM knowledge for factual answers
- Progressive disclosure: Start with a concise answer; offer to drill deeper
- Map-integrated: Responses can include spatial actions (show on map, highlight, navigate)
- Source attribution: Every data point includes its source and freshness
12. References
- LangChain documentation: https://python.langchain.com/
- LangGraph: https://langchain-ai.github.io/langgraph/
- LlamaIndex: https://docs.llamaindex.ai/
- Vanna.ai (text-to-SQL): https://vanna.ai/
- pgvector: https://github.com/pgvector/pgvector
- Anthropic Claude tool use: https://docs.anthropic.com/claude/docs/tool-use
- OpenAI function calling: https://platform.openai.com/docs/guides/function-calling
- Transmodel ontology: https://transmodel-cen.eu/
- Linked GTFS: http://vocab.gtfs.org/
- Wikidata Query Service: https://query.wikidata.org/
- SQLCoder: https://github.com/defog-ai/sqlcoder
- H3 for spatial indexing: https://h3geo.org/