Real-Time Scene Graph Generation in ROS 2

ROS2

OPENCV

YOLO11

OPEN3D

In Spring 2025, I contributed to the Assistive Robotics Scene Understanding team under Georgia Tech's VIP research program, focusing on developing a modular scene graph generation pipeline in ROS 2. The goal was to enable semantic scene understanding by extracting spatial relationships between detected objects using synchronized RGB-D data.

System Architecture Summary

The system is composed of multiple ROS 2 nodes, including modules for camera input publishing, object detection, surface detection, and final scene graph generation. The flow of data between these nodes is illustrated in the diagram below.

Object Detection and Depth Integration

The system leverages YOLOv11l-seg, a lightweight instance segmentation model, integrated into a ROS 2 node to detect and segment objects from live RGB camera streams. This pipeline is extended by incorporation of synchronized depth data from an iPhone's LiDAR sensor, exported using the Record3D app and ingested into ROS via a custom node. RGB and depth frames were synchronized using ApproximateTimeSynchronizer to allow spatial inference at each frame.

For each segmented object, the scene graph builder extracts:

  • Bounding box centroid (x, y)

  • Median depth from a 5×5 window around the centroid

  • Class label and instance ID

The system publishes enriched detection results via a custom ROS message (YoloSegmentationMsg) including pixel coordinates and metric depth.

Scene Graph Construction

The scene graph builder node subscribes to this segmentation topic and dynamically generates object-object spatial relationships using pairwise comparisons. The key relationships modeled include:

  • Left of / Right of: Based on x-coordinate difference

  • In front of / Behind: Based on depth difference, with a configurable tolerance

  • Above / Below: Initially based on y-coordinates, later refined using surface-aware logic

  • Is on: Derived from spatial overlap and vertical offset thresholds

To resolve inconsistencies introduced by perspective distortion, I implemented a reference surface detection node using the Open3D library. This node segments large horizontal planes (e.g., floor, table) from the point cloud and publishes them as a /detected_surfaces topic. If two objects share a surface, their vertical relationship is exempt from y-pixel comparison.

Stability and Optimization

To reduce output volatility from noisy detections or minor object jitter:

  • Scene graph updates are gated by semantic deltas (change in relationship topology)

  • Depth values are filtered using (exponential moving average) EMA smoothing (α=0.4)

  • Spatial tolerances (e.g., ±3 cm for depth) are used to suppress negligible differences

Multiple relationships between the same object pair are consolidated into unified statements for readability (e.g., “cup is in front and to the left of bottle”). An “is_on” relationship is appended when lateral overlap is high and vertical offset falls within a 5–15 cm window.

The complete implementation was tested in both Gazebo and real-world environments using ROS 2 Jazzy, with RViz visualizations and live terminal logs for validation. Code and demos are available here.

This pipeline lays the foundation for goal-conditioned reasoning and symbolic planning in assistive robotics applications. Future extensions will focus on improved temporal consistency and object permanence under occlusion, and integration with task planners.

Published April 2025

CREATED WITH LOVE IN SAN FRANCISCO

Go back home