A team of researchers from the Universities of Bristol, Michigan, and Toronto has introduced EPIC-KITCHENS VISOR, a brand new dataset of pixel annotations capable of segmenting hands and a large variety of active objects in first-person view videos.
Using an AI-powered annotation pipeline, VISOR is able to understand various objects shown in a video, such as human hands, various ingredients, cutlery, and other kitchen-related objects. In total, the dataset includes 271K manual semantic masks of 257 object classes and more than 10M interpolated dense masks, covering 36 hours of 179 untrimmed videos.