IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Abstract

Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

Dataset Overview

Different instruction forms provide different temporal decomposition of the process. Instruction manuals often provide high-level decomposition, while how-to videos demonstrate more detailed steps of each part assembly. Our dataset aligns each step from the instruction manual with a sequence of substep, in which sub-assemblies are formed (a,b). These substeps are further mapped to segments of the how-to videos, which provide a frame-by-frame demonstration of the assembly. Our dataset further provides spatial details of the whole assembly process in 3D (d) observed from the instruction manuals and videos. These details are provided in the form of 6-DoF pose trajectories of the furniture parts. Specifically, for each video frame, the parts being assembled are annotated with a 2D image mask. The pose of each object part in the camera frame is provided, while the relative poses between parts that are being assembled are detailed. With the additional camera intrinsics we provide, the 3D parts can also be projected into 2D image space and aligned with their corresponding 2D mask.

Furniture Models

The IKEA Video Manuals dataset includes 36 furniture models from 6 categories:

Chairs (20 types)
Tables (8 types)
Benches (3 types)
Desks (1 type)
Shelves (1 type)
Misc (3 type)

Annotation

For each video frame, we provide detailed annotations at five levels:

Furniture Level Info

Category: Bench

Name: applaro

Furniture IDs: ['90205182']

Variants: []

Furniture URLs: ['https://www.ikea.com/.../']

Furniture Main Image URL: https://www.ikea....jpg

Video Level Info

Video URL: https://www.youtube.com/...

Other Video URLs for the Same Furniture:

Title: IKEA assembly instructions, APPLARO Bench

Duration: 155

Is_indoor: indoor

Assembly Step Info

Step ID: 1

Step Start: 47.0

Step End: 62.1

Substep ID: 3

Substep Start: 62.04

Substep End: 62.1

Manual Level Info

Manual Step ID: 1

Manual URLs: ['https://www.ikea.com/...=4.pdf']

Manual ID: AA-601524-4

Manual Parts: ['0,1,2', '3']

Manual Connections: [['0,1,2', '3']]

PDF Page: 4

Frame Level Info

Frame Time: 52.82

Number of Camera Changes: 1

Frame Parts: ['0,2', '1', '3']

Frame ID: 1584

Is Keyframe: False

Is Frame Before Keyframe: False

Environment (98 Videos)

Data Collection and Annotation

The IKEA Video Manuals dataset is built on top of the IKEA-Manual dataset and the IAW dataset. It collects 36 segmented 3D furniture models from the IKEA-Manual dataset and 98 associated assembly videos from the IAW dataset.

Annotation Pipeline — Fig. 4: Annotation pipeline for the IKEA Video Manuals dataset.

Pose Refinement

Our pose refinement processimproves the accuracy of 3D part poses in assembly videos. This process is crucial for ensuring physically valid assembly sequences and accurate 3D reconstructions.

Initial Estimation: We start with the Perspective-n-Point (PnP) algorithm to estimate initial poses. While this provides a good 2D overlay, it often results in inaccurate 3D poses.
Issue Identification: By viewing the scene from different angles, particularly side views, we reveal incorrect spatial relationships between parts that aren't apparent from the camera's perspective.
Refinement Process: We've developed an interactive interface that allows annotators to:
- Control the virtual camera using axis-aligned controls
- View the 3D scene from different orthographic perspectives
- Refine part poses by rotating and translating them in 3D space
- Compare the real-time 3D view with corresponding video frames
Relative Pose Accuracy: To improve the accuracy of relative poses, parts that appear together in a video frame are annotated simultaneously, with a visualization of their 3D locations.
Temporal Smoothness: We initialize part poses with poses from the previous frame to improve the temporal smoothness of part trajectories.

This refinement process is essential for constructing physically valid assembly sequences and ensuring the final 3D model accurately matches the real-world object. It addresses key challenges in 3D pose estimation from 2D internet videos, particularly in complex assembly scenarios with multiple interacting parts.

More Examples

This animation shows the assembly process in 4D from the front and side views before and after pose refinement. The relative poses between objects are significantly improved after refinement, leading to a more accurate 3D reconstruction of the final assembly. This is particularly important for complex furniture items with multiple interacting parts, where small errors can compound into significant misalignments in the final assembly.

Applications

The IKEA Video Manuals dataset showcases its versatility and practical relevance through four fundamental furniture analysis tasks:

Assembly Plan Generation
This task focuses on predicting a hierarchical assembly plan by analyzing a sequence of video frames that depict the furniture assembly process. The dataset offers physically realistic assembly plans extracted from Internet videos, providing a more detailed and diverse set of assembly steps compared to plans derived solely from instruction manuals.
Part-Conditioned Segmentation
The objective of part-conditioned segmentation is to generate pixel-wise segmentation masks for furniture sub-assemblies within the assembly process. The diverse videos in the dataset enable the evaluation of part segmentation methods in real-world scenarios. The results emphasize the challenges posed by occlusions, complex backgrounds, and textureless 3D shapes when detecting object parts in Internet videos.
Part-Conditioned Pose Estimation
Part-conditioned pose estimation aims to predict the 6D pose of a furniture sub-assembly in a video frame. Accurate estimation of 3D poses of furniture parts from each video frame is crucial for developing a grounded understanding of the assembly process.

Mask Tracking

We evaluated two state-of-the-art mask tracking models, SAM2 and Cutie, on our dataset. Our analysis reveals significant challenges that these models face in real-world assembly scenarios.

Dataset	SAM2 (Hiera-L) J&F	Cutie-base J&F
Our Dataset	0.736	0.547
MOSE	0.772	0.699
DAVIS 2017 val	0.916	0.879
SA-V val	0.756	0.607
YouTubeVOS 2019 val	0.891	0.870

Error Analysis:

Camera Changes: Both models struggle with abrupt camera movements, often losing track of the object of interest.
Similar Appearances: In assembly scenarios with multiple similar parts, the models often confuse different parts, leading to tracking errors.

Ground Truth Camera Changes — **Ground Truth: Camera Changes**

Cutie Camera Changes — **Cuties: Camera Changes** Cuties successfully tracks some portion of the furniture piece after camera changes, but loses track of the rest.

SAM2 Camera Changes — **SAM2: Camera Changes** The model loses track after abrupt camera movement.

Ground Truth Similar Appearance — **Ground Truth: Similar Appearance**

Cutie Similar Appearance — **Cuties: Similar Appearance** The model fail to track the pad at top of the chair.

SAM2 Similar Appearance — **SAM2: Similar Appearance** SAM2 fails to track some parts of the chair.

Shape Assembly with Instruction Videos
Given a set of 3D parts and an instruction video, shape assembly with instruction videos aims to determine the 6D poses of the 3D parts to assemble them into a complete piece of furniture. The proposed modular video-based shape assembly pipeline incorporates keyframe detection, assembled part recognition, pose estimation, and iterative assembly.

Challenges in Real-World Assembly Videos

**Camera Changes:** Camera changes are common in real-world assembly videos, introducing challenges for tracking and camera calibration.

**Heavy Occlusion:** Occlusions are prevalent in assembly scenarios, especially when multiple parts are being assembled simultaneously, or when the camera is positioned close to the assembly area.

**Construction Deconstruction:** In real-world assembly scenarios, which is different from controlled environments, construction and deconstruction can be ambiguous. In the example shown, the user first assembles a part and then disassembles at 8s.

**Diverse Environment:** Our dataset includes diverse assembly scenarios, such as multi-person assembly and outdoor assembly, which introduce additional challenges for tracking and pose estimation.

BibTeX

@inproceedings{
      liu2024ikea,
      title={{IKEA} Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos},
      author={Yunong Liu and Cristobal Eyzaguirre and Manling Li and Shubh Khanna and Juan Carlos Niebles and Vineeth Ravi and Saumitra Mishra and Weiyu Liu and Jiajun Wu},
      booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year={2024}
      }