5.6 Folding experiments
In this section, the proposed approach is validated on a real world experiment involving manipulation of deformable objects, namely folding of a T-shirt. As opposed to the box stacking task, the true underlying states are in this case unknown and it is therefore not possible to define an automatic verification of the correctness of a given visual action plan. The folding task setup, depicted in Figure 5.11, is composed of a Rethink Robotics Baxter robot equipped with a Primesense RGB-D camera mounted on its torso.
The execution videos of all the performed experiments and respec-tive visual action plans can be found on the project website2. For this task, a dataset TI containing 1283 training tuples was col-lected. Each tuple consists of two images of size 256 × 256 × 3, and action specific information u = (p, r, h) where p = (pr, pc) are the picking coordinates, r = (rr, rc) the releasing coordinates and h picking height. An example of an action and a no-action pair is shown in Figure 5.4. The values pr, pc, rr, rc ∈ {0, . . . , 255} corre-spond to image coordinates, while h ∈ {0, 1} is either the height of the table or a value measured from the RGB-D camera to pick up only the top layer of the shirt. Note that the latter is a challenging task in its own [175] and leads to decreased performance when it is necessary to perform it, as shown in the following. The dataset TI was collected by a human operator manually selecting pick and release points on images showing a given T-shirt configuration, and recording the corresponding action and following configura-tion. No-action pairs, representing ≈ 37% of training tuples in TI, were generated by slightly perturbing the cloth appearance.
Finally, a re-planning step after each action completion is in-troduced as shown in Figure 5.12. This accounts for potential execution uncertainties, such as inaccuracies in grasping or in the positioning phases of pick-and-place operations, which lead to ob-servations different from the ones planned in PI. Note that after each action execution, the current observation of the cloth is con-sidered as a new start observation, and a new visual action plan
2https://visual-action-planning.github.io/lsr-v2/
Figure 5.11: Experimental setup composed of a Baxter Robot and an RGB-D camera.
is produced until the goal observation is reached or the task is terminated. Such re-planning setup is used for all folding experi-ments. As the goal configuration does not allude to how the sleeves should be folded, the LSR suggests multiple latent plans. A sub-set of the corresponding visual action plans is shown on the left of Figure 5.12. If multiple plans are generated, a human operator selects one to execute. After the first execution, the ambiguity arising from the sleeve folding is removed. The re-planning there-fore generates a single plan, shown in the right, that leads from start to goal state.
Figure 5.12: Execution of the folding task with re-planning. On the left, a set of initial visual action plans reaching the goal state is proposed. After the first execution, only one viable visual action plan remains.
5.6. Folding experiments 189
Implementation details
Similarly to Section 5.5, the notation VAEld-f -d is used to denote a VAE with ld-dimensional latent space, where f stands for the folding task and d indicates whether or not the model was trained with the action loss (5.4). In particular, the case d = b denotes baseline VAEs which are trained with the original training objec-tive (5.5), while d = Lp denotes action VAEs trained with the objective (5.6) containing the action term (5.4) using metric Lp
for p ∈ {1, 2, ∞}. The VAEs are modeled with the same ResNet architecture and same hyperparameters β, γ and dm as in the box stacking task introduced in Section 5.5 but increase the latent space dimension to ld = 16. For the LSR, the same notation as in Section 5.5.2 is adopted, where LSR -Lp denotes a graph obtained by using metric Lp in Algorithm 2. The upper bound cmax on the maximum number of graph-connected components in (5.8) is to 5, and the search interval boundaries τmin and τmax in Algorithm 3 are set to 0 and 3.5, respectively. The performance of the APMs and the evaluation of the system is based on the VAE16-f -L1 re-alization of the MM. The experiments are thus performed using APN16-f -L1 which is trained on latent action pairs Tz extracted by the latent mapping ξ of VAE16-f -L1. Five models are trained for 500 epochs using different random seeds as in case of VAEs.
Finally, 15% of the training dataset is used as a validation split used to extract the best performing model for the evaluation.
Covered regions using LSR
As in the box stacking tasks, the covered regions identified by the LSR are analyzed. To this aim, the model VAE16-f -L1 is used and the following inputs are considered: 224 novel observations, that correspond to possible states of the system and that are not used during training, and 5000 images from each of the 3D Shapes and CIFAR-10 datasets, which represent out-of-distribution samples that are not resembling the training data. The LSR achieves good recognition performance even in the folding task. More specifi-cally, on average 213/224 samples representing the true underlying
states of the system are correctly recognized as covered, resulting in 95 ± 2.4% accuracy averaged over the 5 different random seeds.
For 3D Shapes dataset, 0/5000 samples are recognized as covered, while only 20/5000 samples from the CIFAR-10 dataset are on av-erage wrongly recognized as covered. This analysis thus confirms the effectiveness of the LSR in capturing the covered regions of the latent space. It also shows the greater complexity of learning the latent mapping on real world observations, representing states of deformable objects, than on the simulated observations.
System performance
The proposed method is here experimentally validated and com-pared with the preliminary framework [153] on which it builds, as well as it is additionally employed on a more challenging fold that involves picking a layer of the cloth on top of another layer. In particular, the following quantities are evaluated: (i) the system success rate, i.e., a folding is considered successful if the system is able to actually fold the T-shirt into the desired goal configura-tion, (ii) the percentage of successful transitions of the system, i.e., a transition is considered successful if the respective folding step is executed correctly, (iii-iv) the quality of the generated visual plans PI and action plans Pu, i.e., a visual (action) plan successful if all the intermediate states (actions) are correct. This evaluation is done by a human for a given fold on the very first generated visual action plan.
Concerning the comparison with [153], five type of folds are carried out and each fold is repeated five times using framework S-OUR, consisting of VAE16-f -L1, LSR -L1 and APN16-f -L1, and is compared with work [153] obtained using S-L1, S-L2 and S-L∞. The results are shown in Table 5.6, while all execution videos, including the respective visual action plans, are available on the website2. It can be noticed that S-PROP outperforms the sys-tems from [153] with a notable 96% system performance, only missing a single folding step which results in a transition perfor-mance of 99%. As for S-L1 [153], S-PROP also achieves optimal
5.6. Folding experiments 191
Method Syst. Trans. PI P u Fold 1 to 5 - comparison to [153]
S-PROP 96% 99% 100% 100%
S-L1 [153] 80% 90% 100% 100%
S-L2 [153] 40% 77% 60% 60%
S-L∞ [153] 24% 44% 56% 36%
Fold layer
S-PROP 50% 83% 100% 100%
Table 5.6: Results (best in bold) for executing visual action plans on 5 folding tasks (each repeated 5 times) shown in the top. The bottom row shows the results on the fold requiring to pick the top layer of the garment (repeated 10 times).
performance when scoring the initial visual plans PI as well as the initial action plans Pu.
Concerning the additional fold, it is repeated 10 times and the final results are reported in Table 5.6 (bottom row). It can be observed that the system has no trouble planning the folding but fails to pick up the top layer of the T-shirt in half of the cases during the plan execution. This is due to the imprecision of the Baxter and the difficulty of picking up layered clothing. The generated action plan, however, correctly identify the layer fold as a fold where it had to pick the top layer. Therefore, methods that are specialized in performing a layered cloth picking could be integrated into the proposed system.
Finally, the APN performance is also evaluated through the mean squared error (MSE) between the predicted and the ground truth action specifics on picking and releasing as well as the total model error. In detail, it achieves 82.6 ± 22.9, 29.3 ± 2.2, 270.6 ± 158.2, 71.8±15.0, 0.0±0.0, 454.3±153.8 for the picking and release coordinates pc, pr, rc, rr, h and the overall error, respectively.
It can be therefore concluded that the proposed framework al-lows to effectively manipulate deformable objects in a real setup as well as improved MM, LSR and APM modules together contribute to a significant better system than in [153].
Chapter 6 Conclusions
This thesis work aimed to investigate human multi-robot inter-action from multiple perspectives. Indeed, coordinating multiple robots while letting them interact with human operators opens up several issues that need to be addressed for an effective collabora-tion. First, human safety must be ensured at all times in scenarios where humans and robots work side-by-side. This must be guar-anteed regardless of the human dynamic behavior and the task.
Next, proper strategies must be designed which allow to achieve synergy between human and robots according to the desired in-teraction, whether assistance or shared control. Then, the robots’
control strategy must ensure the achievement of the desired hu-man interaction while complying with possible constraints of the robotic system. In this regard, distributed architectures are gen-erally desirable since they confer higher flexibility and robustness to faults to the system with respect to centralized ones.
In light of the above, this thesis presented solutions that al-low to realize human multi-robot interaction to different extents by combining several methodologies from control theory, robotics and machine learning. More specifically, the problem of ensuring a reliable multi-robot system for human operators was first ad-dressed. To this aim, a distributed fault detection and isolation strategy was proposed which, on the basis of residual signals and dynamic thresholds, enables each robot to monitor the state of