Resumen
In the quest for industrial efficiency, human performance within manufacturing systems remains pivotal. Traditional time study methods, reliant on direct observation and manual video analysis, are increasingly inadequate, given technological advancements. This research explores the automation of time study methods by deploying deep learning models for action segmentation, scrutinizing the efficacy of various architectural strategies. A dataset, featuring nine work activities performed by four subjects on three product types, was collected from a real manufacturing assembly process. Our methodology hinged on a two-step video processing framework, capturing activities from two perspectives: overhead and hand-focused. Through experimentation with 27 distinctive models varying in viewpoint, feature extraction method, and the architecture of the segmentation model, we identified improvements in temporal segmentation precision measured with the F1@IoU metric. Our findings highlight the limitations of basic Transformer models in action segmentation tasks, due to their lack of inductive bias and the limitations of a smaller dataset scale. Conversely, the 1D CNN and biLSTM architectures demonstrated proficiency in temporal data modeling, advocating for architectural adaptability over mere scale. The results contribute to the field by underscoring the interplay between model architecture, feature extraction method, and viewpoint integration in refining time study methodologies.