How to harness Artificial Intelligence for code translation
Nowadays informatic systems evolve due to new programming languages which are gradually spreading, such as Swift from Apple, Rust from Mozilla, Go from Google, TypeScript from Microsoft and many more. Even if the perfect programming language for every problem does not exist, there probably is a language that is well suited to solving the specific problem at hand. Unfortunately the adoption of the best-suited language depends on several factors and it is not always feasible. Moreover, considering that billions of lines of code have already been written so far, wouldn’t it be nice to have a way to pass seamlessly from any programming language to another? This is exactly what a source-to-source compiler does.
Core Reply, the Reply company specialized in Core System innovation for financial institutions, has already begun to face the challenge of code transformation in the context of Legacy Platform modernization. Many banks and financial institutions rely on legacy platforms powered by billions of lines of COBOL code. However, COBOL is expensive to maintain and represents a boundless source of technical debt. One of the approaches adopted by Core Reply to solve these problems is the automatic code transformation to reduce direct and indirect costs due to the backwardness of the technological infrastructure. This is where effective and automatic source-to-source compilers play their roles.
A typical compiler translates source code from a high-level to a lower-level programming language, as the machine code for example. A Transcompiler, or transpiler or source-to-source compiler, instead, can be considered as a translator that converts between programming languages operating on a similar level of abstraction.
These days, the market requires companies to be as fast evolving as possible, but it is not rare to find a situation in which this process is prevented by existing legacy solutions in which changes are risky and time-consuming. Banks and traditional financial institutions are a good example of this kind of situation: they usually have a huge code base written in COBOL, poorly documented but extremely important for their business. In these cases, rewriting the code can be an insurmountable task, while manually translating it can be too expansive and can take years. A transpiler can fill the technical debt by automatically producing a new code base, completely isofunctional to the old one and ready to be deployed in an open environment.
In traditional approaches, trans-compilation is done by constructing Syntax Trees from the source and by using handwritten rules to convert from the source language to the target one. The two main points in this approach are:
• It requires a high level of understanding and expert knowledge about both the starting programming language and the target one;
• Translating from dynamically-typed languages, like Python, and from statically-typed ones, like Java, requires to infer the type of the variables, which is not only difficult but often impossible.
The process itself can be seen as the translation from one spoken language to another. In both scenarios, the “translator” needs knowledge about both languages. AI helped to solve the problem of translation between spoken languages to reach modern state-of-the-art results. The two main branches of research in this field are Machine Translation; the task of automatically converting source text in one language to text in another one, and Neural Machine Translation, or NMT, which uses neural networks to determine a statistical model to be applied to machine translation.
In this scenario we can use the TransCoder, the AI-based project developed by Facebook AI Labs, which harnesses the AI potential, and in particular the knowledge coming from Neural Machine Translation, and apply it to the field of transcompilation.
The AI application to the source code translation problem can potentially overcome the downsides of trans-compilation that is related to the expert knowledge needed and the time required to build such a system. Considering that the task is not so dissimilar from text translation, it is reasonable to assume that the state-of-the-art results reached in the language field can lead to the same results also in the source code translation. The main obstacle to pursue this approach is the lack of parallel data needed to train traditional NMT models.
TransCoder’s innovation consists in borrowing a few principles from NMT state of the art techniques and applying them to the source code translation while implementing a weakly supervised training approach to overcome the lack of parallel training data.
To evaluate the effectiveness of the model, a new measure has been proposed by Facebook researchers. This measure, called computational accuracy, takes into account the output produced by the executed input source code and the one produced by the executed translated code when fed with the same input. This approach differs from traditional accuracy measures for these tasks, that typically compare the real source code produced. By using traditional text translation accuracy measures, output with very small syntactic discrepancies with the control state are considered as a highly valuable result, even if, when referred to programming languages, source codes with small differences can produce very different results once executed.
TransCoder proved to be able to outperform state-of-the-art techniques and commercial solutions when translating from C++ to Java and from Java to Python.
Reply has been working on this new branch of AI-powered transpiler tools, in line with always aiming to provide the best market solution to companies that need to translate their code base into a new programming language, which can be newer, faster and, potentially, cheaper than the old one.
Thanks also to Core Reply's experience in using transcompilation techniques as part of legacy platform modernization projects, the Reply team is verifying the validity of the approach proposed by the Facebook AI Labs by testing it on a number of real-world translation cases of code snippets with different programming languages.
Once satisfactory results have been achieved, the Reply experimentation will continue with the enrichment and training of the TransCoder model in order to face the task of COBOL translation to another language with a Computational Accuracy comparable to the one achieved in translation from C++ to Java and from Java to Python.
This R&D activity is part of a wider stream, in which Reply is working alongside its customers to optimize the process of creating digital products, through automation and machine learning solutions applied throughout the software development process, from the requirement to the go live and subsequent evolution.