Resumen
Software developers represent the bastion of application security against the overwhelming cyber-attacks which target all organizations and affect their resilience. As security weaknesses which may be introduced during the process of code writing are complex and matching different and variate skills, most applications are launched intrinsically vulnerable. We have advanced our research for a security scanner able to use automated learning techniques based on machine learning algorithms to recognize patterns of security weaknesses in source code. To make the scanner independent on the programming language, the source code is converted to a vectorial representation using natural language processing methods, which are able to retain semantical traits of the original code and at the same time to reduce the dependency on the lexical structure of the program. The security flaws detection performance is in the ranges accepted by software security professionals (recall > 0.94) even when vulnerable samples are very low represented in the dataset (e.g., less than 4% vulnerable code for a specific CWE in the dataset). No significant change or adaptation is needed to change the source code language under scrutiny. We apply this approach on detecting Common Weaknesses Enumeration (CWE) vulnerabilities in datasets provided by NIST (Test suites?NIST Software Assurance Reference Dataset).