Resumen
With the growth in commercial aviation traffic and the need for improved environmental performance, strategies to lower emissions that can be implemented in the near term are necessary. Since novel technology takes time to enter the market, operational improvements that employ existing aircraft and require no new infrastructure are fit for this goal. While quantified data collected throughout aviation, such as arrival/departure statistics and flight data, have been well-utilized, text data collected through safety reports have not been leveraged to their full extent. In this paper, a methodology is presented that can use aviation text data to identify high-level causes of flight delays and cancellations, using delays as a metric of operational inefficiency. The dataset is extracted from the Aviation Safety Reporting System (ASRS), which includes voluntary safety incident reports in text narrative and metadata formats. The methodology uses natural language processing tools, K Means clustering, and dimensionality reduction by t-Distributed Stochastic Neighbor Embedding (t-SNE) to categorize and visualize narratives. The method identified 7 major clusters and a total of 23 sub-clusters. A comparison between the subclusters? topics and the causes of flight delays revealed by the quantified data shows that the ASRS database provides a unique safety perspective to delay cause identification, as illustrated by the method?s identification of maintenance as the main cause of delays, rather than weather.