Environmentally Aware Prosthesis

In order to make prosthetics more responsive to the environment, we needed a new modality that could sense environmental factors before they happen. We settled upon adding a small camera to the prosthetic, which can see the ground in the immediate vicinity of the user’s foot. Using this new modality, as a user approaches stairs or a curb, the prosthesis automatically adjusts to ensure the subject does not trip and that the robotic prosthetic takes the optimal control action. In the same vein, small but significant changes to the control action are necessary to adjust the gait when the user walks over grass, sand, or gravel. Our method enables users to seamlessly transition between terrain types while retaining control over their actions. The overview image shows an example of our control policy and the predicted depth map as point clouds. Further, we included various examples of our depth prediction over various terrain types below.

Data Collection

In tandem with this research, we collected and released a custom dataset of 65 scenes with over 50,000 RGB-depth image pairs involving a subject walking over various obstacles and surfaces, including sidewalks, roadways, curbs, gravel, grass, carpeting, and up/down stairs. Data collection occurred in differing lighting conditions at a fixed depth range (0.0-6.0 meters). Among the image pairs, 1,723 contain curbs, 5,816 contain uneven floors, and 6,201 were collected while stepping up or down stairs. Therefore, 27.48% of the dataset involves one of these obstacles. The data collection process also included varied camera positions on the lower limb. Note that depth values in our figures are depicted in a grey color map with darker coloring indicating closer pixels and lighter coloring pixels that are further away. The lowest row of the below figure shows annotated masks used to train the masked-RCNN model to predict foot area and improve temporal consistency. drawing

Training

We utilized a deep convolutional neural network set up in an autoencoder configuration, where the input takes the raw RGB image and the output predicts a depth map of the image. We can extract latent terrain features from the raw image by conditioning the network during training on depth-based terrain features. An additional predictive model utilizes the bottleneck layer, containing the latent terrain features, to determine the optimal control action given the condition of the terrain. Our novel contributions in this work include: (1) an end-to-end training approach that outputs depth maps that are more accurate and less noisy than IR-based depth sensors and (2) a control policy that yields appropriate control actions for various terrain types.

We trained our model using 80% (52 scenes) of the custom data set with the Adam optimizer, a learning rate of 10-5, and withheld 20% (13 scenes) for testing and validation. The RGB input size is 90x160x3, whereas the output has a single channel for the predicted depth map of 90x160. The ground truth is down-sampled to 3x5, 6x10, 12x20, 23x40, and 45x80 to match the network’s intermediate depth feature outputs. Training and evaluation were performed on eight network architectures, namely the ResNet-50 and EfficientNet encoders with decoders formed with Residual learning, DispNet, and the combination with convolutional layers. All models were trained for 100 epochs and fine-tuned by applying a disparity loss with a learning rate of 10-7 for 30 epochs. Additionally, we compared the MiDaS v3.0 depth prediction model and DPT-Large to our models. Furthermore, we fine-tuned a pre-trained mask RCNN as the masking network using binary cross-entropy loss with masks from the custom data set.

The videos below show the predicted depth maps as grayscale, where the darker coloring represents closer pixels and lighter that of pixels that are further away. We show here four images of a subject walking over gravel outdoors, carpet indoors, and up/down outdoor stairs. The RBG images on the left are used to estimate the depth maps on the right directly. The videos show that our method of monocular depth prediction is robust to ground type, lighting conditions, and gait.

Further Information

For further information, please consult our github repository.