IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

1Stanford University 2J.P. Morgan AI Research

The IKEA Video Manuals dataset is the first dataset to provide 4D grounding of assembly instructions on Internet videos, offering high-quality, spatial-temporal alignments between assembly instructions, 3D models, and real-world internet videos.

Teaser

Abstract

Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present four applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly.

Dataset Overview

Different instruction forms provide different temporal decomposition of the process. Instruction manuals often provide high-level decomposition, while how-to videos demonstrate more detailed steps of each part assembly. Our dataset aligns each step from the instruction manual with a sequence of substep, in which sub-assemblies are formed (a,b). These substeps are further mapped to segments of the how-to videos, which provide a frame-by-frame demonstration of the assembly. Our dataset further provides spatial details of the whole assembly process in 3D (d) observed from the instruction manuals and videos. These details are provided in the form of 6-DoF pose trajectories of the furniture parts. Specifically, for each video frame, the parts being assembled are annotated with a 2D image mask. The pose of each object part in the camera frame is provided, while the relative poses between parts that are being assembled are detailed. With the additional camera intrinsics we provide, the 3D parts can also be projected into 2D image space and aligned with their corresponding 2D mask.
Dataset Overview

Furniture Models

The IKEA Video Manuals dataset includes 36 furniture models from 6 categories:

  • Chairs (20 types)
  • Tables (8 types)
  • Benches (3 types)
  • Desks (1 type)
  • Shelves (1 type)
  • Misc (3 type)
All Furniture Models

Annotation

For each video frame, we provide detailed annotations at five levels:

Furniture Level Info

Category: Bench

Name: applaro

Furniture IDs: ['90205182']

Variants: []

Furniture URLs: ['https://www.ikea.com/.../']

Furniture Main Image URL: https://www.ikea....jpg

Video Level Info

Video URL: https://www.youtube.com/...

Other Video URLs for the Same Furniture:

Title: IKEA assembly instructions, APPLARO Bench

Duration: 155

Is_indoor: indoor

Assembly Step Info

Step ID: 1

Step Start: 47.0

Step End: 62.1

Substep ID: 3

Substep Start: 62.04

Substep End: 62.1

Manual Level Info

Manual Step ID: 1

Manual URLs: ['https://www.ikea.com/...=4.pdf']

Manual ID: AA-601524-4

Manual Parts: ['0,1,2', '3']

Manual Connections: [['0,1,2', '3']]

PDF Page: 4

Manual + Masks

Frame Level Info

Frame Time: 52.82

Number of Camera Changes: 1

Frame Parts: ['0,2', '1', '3']

Frame ID: 1584

Is Keyframe: False

Is Frame Before Keyframe: False

Frame Image Masks Object Poses

Environment (98 Videos)

NEW !

Data Collection and Annotation

The IKEA Video Manuals dataset is built on top of the IKEA-Manual dataset and the IAW dataset. It collects 36 segmented 3D furniture models from the IKEA-Manual dataset and 98 associated assembly videos from the IAW dataset.

Annotation Pipeline
Fig. 4: Annotation pipeline for the IKEA Video Manuals dataset.

Pose Refinement

NEW !

Our pose refinement processimproves the accuracy of 3D part poses in assembly videos. This process is crucial for ensuring physically valid assembly sequences and accurate 3D reconstructions.

  1. Initial Estimation: We start with the Perspective-n-Point (PnP) algorithm to estimate initial poses. While this provides a good 2D overlay, it often results in inaccurate 3D poses.
  2. Issue Identification: By viewing the scene from different angles, particularly side views, we reveal incorrect spatial relationships between parts that aren't apparent from the camera's perspective.
  3. Refinement Process: We've developed an interactive interface that allows annotators to:
    • Control the virtual camera using axis-aligned controls
    • View the 3D scene from different orthographic perspectives
    • Refine part poses by rotating and translating them in 3D space
    • Compare the real-time 3D view with corresponding video frames
  4. Relative Pose Accuracy: To improve the accuracy of relative poses, parts that appear together in a video frame are annotated simultaneously, with a visualization of their 3D locations.
  5. Temporal Smoothness: We initialize part poses with poses from the previous frame to improve the temporal smoothness of part trajectories.
Pose Refinement Process

This refinement process is essential for constructing physically valid assembly sequences and ensuring the final 3D model accurately matches the real-world object. It addresses key challenges in 3D pose estimation from 2D internet videos, particularly in complex assembly scenarios with multiple interacting parts.

More Examples

Pose Refinement

This animation shows the assembly process in 4D from the front and side views before and after pose refinement. The relative poses between objects are significantly improved after refinement, leading to a more accurate 3D reconstruction of the final assembly. This is particularly important for complex furniture items with multiple interacting parts, where small errors can compound into significant misalignments in the final assembly.

Applications

The IKEA Video Manuals dataset showcases its versatility and practical relevance through four fundamental furniture analysis tasks:

  1. Assembly Plan Generation

    This task focuses on predicting a hierarchical assembly plan by analyzing a sequence of video frames that depict the furniture assembly process. The dataset offers physically realistic assembly plans extracted from Internet videos, providing a more detailed and diverse set of assembly steps compared to plans derived solely from instruction manuals.

    Assembly Plan Generation
  2. Part-Conditioned Segmentation

    The objective of part-conditioned segmentation is to generate pixel-wise segmentation masks for furniture sub-assemblies within the assembly process. The diverse videos in the dataset enable the evaluation of part segmentation methods in real-world scenarios. The results emphasize the challenges posed by occlusions, complex backgrounds, and textureless 3D shapes when detecting object parts in Internet videos.

    Part-Conditioned Segmentation
  3. Part-Conditioned Pose Estimation

    Part-conditioned pose estimation aims to predict the 6D pose of a furniture sub-assembly in a video frame. Accurate estimation of 3D poses of furniture parts from each video frame is crucial for developing a grounded understanding of the assembly process.

    Part-Conditioned Pose Estimation
  4. Mask Tracking NEW !

    We evaluated two state-of-the-art mask tracking models, SAM2 and Cutie, on our dataset. Our analysis reveals significant challenges that these models face in real-world assembly scenarios.

    Dataset SAM2 (Hiera-L) J&F Cutie-base J&F
    Our Dataset 0.736 0.547
    MOSE 0.772 0.699
    DAVIS 2017 val 0.916 0.879
    SA-V val 0.756 0.607
    YouTubeVOS 2019 val 0.891 0.870

    Error Analysis:

    • Camera Changes: Both models struggle with abrupt camera movements, often losing track of the object of interest.
    • Similar Appearances: In assembly scenarios with multiple similar parts, the models often confuse different parts, leading to tracking errors.
    Ground Truth Camera Changes
    Ground Truth: Camera Changes
    Cutie Camera Changes
    Cuties: Camera Changes Cuties successfully tracks some portion of the furniture piece after camera changes, but loses track of the rest.
    SAM2 Camera Changes
    SAM2: Camera Changes The model loses track after abrupt camera movement.
    Ground Truth Similar Appearance
    Ground Truth: Similar Appearance
    Cutie Similar Appearance
    Cuties: Similar Appearance The model fail to track the pad at top of the chair.
    SAM2 Similar Appearance
    SAM2: Similar Appearance SAM2 fails to track some parts of the chair.
  5. Shape Assembly with Instruction Videos

    Given a set of 3D parts and an instruction video, shape assembly with instruction videos aims to determine the 6D poses of the 3D parts to assemble them into a complete piece of furniture. The proposed modular video-based shape assembly pipeline incorporates keyframe detection, assembled part recognition, pose estimation, and iterative assembly.

Challenges in Real-World Assembly Videos

NEW !
Camera Changes
Camera Changes: Camera changes are common in real-world assembly videos, introducing challenges for tracking and camera calibration.
Heavy Occlusion
Heavy Occlusion: Occlusions are prevalent in assembly scenarios, especially when multiple parts are being assembled simultaneously, or when the camera is positioned close to the assembly area.
Construction Deconstruction
Construction Deconstruction: In real-world assembly scenarios, which is different from controlled environments, construction and deconstruction can be ambiguous. In the example shown, the user first assembles a part and then disassembles at 8s.
Diverse Environment
Diverse Environment: Our dataset includes diverse assembly scenarios, such as multi-person assembly and outdoor assembly, which introduce additional challenges for tracking and pose estimation.