CodeImprove: Program Adaptation for Deep Code Models

Abstract

Leveraging deep learning (DL)-based code analysis tools to solve software engineering tasks is becoming increasingly popular. However, these tools face challenges related to performance limitations, scalability issues, and the potential for erroneous results (i.e., mispredictions). Retraining is often required to address these issues, but frequent model updates are costly in labeling and deployment. In this paper, we explore an alternative solution: Adapting the program inputs to the code models. This can be achieved by two steps: 1) input validation that focuses on identifying whether an input is an out-of-scope input program that are beyond a model’s handling capability, and 2) input adaptation that adapts out-of-scope inputs to become in-scope inputs. Validating program input is challenging, as current techniques focus on continuous inputs such as image data and fail with discrete inputs like code data, which have unique characteristics and are processed differently by deep learning models. Adapting out-of-scope programs is also challenging due to their vast search spaces. Therefore, in this paper, we propose CodeImprove, which distinguishes out-of-scope from normal inputs and converts such out-of-scope inputs back to in-scope inputs through program transformation. In particular, we propose a validity score metric to identify out-of-scope inputs and leverage genetics algorithms to apply semantic preserving program transformation to convert out-of-scope inputs to in-scope inputs. Our experimental results show CodeImprove can enhance upto 8.78% of accuracy, and 51.28% of relative improvements in three code models on two SE tasks. Additionally, our input validation is promising in detecting out-of-scope inputs (AUC score of 0.924).

Overview

Overview
Figure 1. Overview of CodeImprove




Figure 1. shows the overview of CodeImprove. CodeImprove consists of two main technical phases: Input Validation phase to detect out-of-scope inputs from normal inputs, and Input Adaptation phase to transform out-of-scope inputs to in-scope inputs.

Problem Definition

Given a code model M and an input code snippet x, the class with the highest probability is the final prediction result of M for x, denoted as M(x). During deployment, it is challenging to ensure that every prediction is correct. Thus, our goal is to improve model performance on test inputs via code adaptation.

Let y be the ground truth label of x and y' = M(x') be the prediction result after adapting x to x' through a series of transformations.

The objective is to compute a validation metric V and a set of transformations Tn such that the loss function of the adapted prediction L(y') is smaller than the original loss L(y).

We aim to find V and T under the following constraints:

The loss L is characterized by the distance from the ground truth. For example, a simple loss function could be:

The challenge is to develop an effective validation metric V and a sequence of transformations Tn that adapt mispredicted inputs, thereby improving the model's performance without necessitating retraining.

Transformation Rules

    CodeImprove currently employs 15 transformation operators. The list of operators are as follows:

  1. Function and Variable Name Renaming
  2. We will extract all identifiers from a given input by parsing the code and capturing the identifiers by filtering out keywords, macros, and special identifiers (e.g., main, stdio, etc.). This step ensures that the identifiers are only variable/function names.
    Next, the process of renaming identifiers involves two steps: identifying and subsituting the most important identifier.
    Our approach is focused on obtaining the logit outputs by the target model for supervision. We introduce a metric called the importance score. To compute the importance score we retrieved the tokens of both original code C = [t0, · · ·, ti, · · ·] and code after replacing the identifier with [MASK] at the ith token Ci = [t0, · ·, ti-1, [MASK], ti+1, · · ·]. The logit output of C and Ci denoted as Oy(C) and Oy(Ci) respectively. The importance score is denoted as:

    Ii = Oy(C) − Oy(Ci)

    Once the importance scores are computed, we select the top K identifiers. For each identifier in K we replace the original identifier by generating candidates using the RoBERTa masked language model. Then we selected the identifier with highest logit output value to replace the original code. Maintaining K identifiers ensure to limit the large search space because the search space is limited to K identifiers.

    
                                int i; to int t; 
                            

  3. for-loop to while-loop.
                
                For-loop:
                    for (initialization; condition; increment) {
                    // Loop body
                    // Statements to execute
                    }
    
                While-loop Transformation:
                    initialization
                    while (condition) {
                        // Loop body
                        // Statements to execute
                        increment;
                    }
                  
                  

  4. while-loop to for-loop.

                   
                While-loop:
                    while (condition) {
                            // Loop body
                            // Statements to execute
                            increment;
                    }
    
                For-loop Transformation:
                    for(initialization; condition; increment) {
                        // Loop body
                        // Statements to execute
                    }
                    
                    
  5. do-loop to while-loop.

    During do-while-loop transformation into a while-loop, the body statement inside the do-while-loop will be executed once before entering the transformed while-loop.

                  
                   Do-while-loop:
                        do {
                            // Loop body
                            // Statements to execute
                            increment;
                        } while (condition);
    
                    While-loop Transformation:
                            // Execute the loop body
                            // Statements to execute
                            while (condition) {
                            // Loop body
                            // Statements to execute
                            increment;
                        }
                        
                        
  6. if elseif to if else
                  
                   if else-if :
                        if(condition A){ 
                              //body A
                        }else-if (condition B){
                              //body B
                        } else {
                              //body C
                        }
    
                    if-else transformation:
                        if(condition A){ 
                              //body A
                        } else {
                             if (condition B) {
                                //body B
                             }
                             else {
                                //body C
                              }
                         } 
                        
                        
  7. if else to if else-if
                  
    
                      if-else:
                        if(condition A){ 
                              //body A
                        } else {
                             if (condition B) {
                                //body B
                             }
                             else {
                                //body C
                              }
                         } 
    
                        if else-if transformation:
                          if(condition A){ 
                              //body A
                            }else-if (condition B){
                              //body B
                            } else {
                              //body C
                            }
                  
                  
  8. Switch statements to if elseif
    
                    
                  
                      switch statement:
                          switch (a) {
                            case A: body A
                            case B: body B
                            default: body C
                         }
    
                      if else-if transformation: 
                          if (a ==A) body A
                          else if (a==B) body B
                          else body C 
                 
                  
                   
  9. Relation expression transformation.
    Transform relational expressions such as
     a < b to b >a 
  10. Modification to Unary operations.
    Modify unary operations such as
     i++; to i=i+1;
  11. Modifications to incremental operations.
    Modifies incremental operators such as
     i+=1; to i=i+1;
  12. Modifying constants
    Modify the constant values in expressions such as
     i = 0; to i = 10-10;
  13. Modifications to variable definitions.
    Modifies the definitions of variables.
     
                  int b = 0; to int b; b=0;
                  
  14. Add junk code.

    Adds code that will never be executed

    
                          before adding the junk code: 
                              if(a){
                                  \\body A
                              }
    
                          after adding the junk code:
                              if(a){
                                  \\body A
                                  if (0) return 0;
                              }
    
    
                      
  15. Change order of statements in a block.

    Only reorder the statements without any data- or control-dependency.

    
                        before reorder: 
                          a = b+10;
                          c = d+10;
    
                        after reorder:
                          c = d+10;
                          a = b+10;
    
                          
                      
  16. Deleting statements that print debugging hints and comments.

    Only delete comments or statements that print debugging hints or intermediate results.

    
                            printf("trial");
                            //comments;
    
    
                             to 
    
                              printf("trial); 
                               //comments;  
                        
    
                         
                     

Experiment Results

We show that CodeImprove's input validation outperforms exisiting techniques. We compared our technique with CLD, another approach used by leveraging the proportion to the best prediction with the next top two predictions. Also, we tried CLD with CodeImprove's equations 1-3 (shown in the paper) generating sub-models without dropout. Based on Table 1, we can conclude that CodeImprove still outperforms the other techniques in terms of AUC score.

Table 1: AUC Comparison of Different Input Validation Techniques
Experiment Vulnerability Detection Defect Prediction
CodeBERT RoBERTa GraphCodeBERT CodeBERT RoBERTa GraphCodeBERT
CLD 0.850 0.819 0.757 0.889 0.828 0.873
CodeImprove-no-dropout 0.850 0.819 0.757 0.891 0.903 0.898
CodeImprove 0.876 0.825 0.781 0.911 0.924 0.909

More Examples on Adapted Inputs by CodeImprove

The objective of this section is to examine whether the adapted programs maintain the semantics of the original inputs. We investigate the effectiveness of applying semantic preserving program transformations to adapt out-of-scope inputs. Based on our investigation we provide examples below

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 1: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the transformation of the given code in Fig. 1, several key changes were made to improve the model's understanding of the code data. Originally (lines 10-12), variables were declared and assigned together, and the loop increment was written as i++. The increment operation within the loop was simplified to count++. After transformation (lines 10-18), variables were declared separately (int count; int i; int j;) and assigned later (cout = 0;), the loop increment was explicitly written as i = i + 1, and the increment operation was clarified to count = count + 1. These changes made the code logic more explicit, helping the model accurately predict the “wrong output”.

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 2: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the transformation of the provided code in Fig. 2, several changes were made to improve the model's prediction accuracy. Initially, in the original code (lines 6-34), the loops and condition checks were straightforward, with nested loops for scanning and processing the array. First, the while loop in line 6 was converted to a for loop. Then, the initialization and increment of for loops lines 22 and 24 has been seperated in the Fig. 2b. At last, After transformation (lines 7-42), some for loops (lines 25, 29, 30 in Fig.2a) were converted into while loops with explicit increment operations (k++, j++), and variables were initialized separately. These adjustments made the model to understand code logic differently, resulting in the model accurately predicting the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 3: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the code transformation in Fig. 3, key changes were made to improve the prediction accuracy by CodeImprove. Initially, the code used for loops (lines 8-9, 11-12) to iterate over the array for scanning values and finding the minimum. In the transformed version, these for loops were replaced with while loops (lines 8-11, 16-19), where the loop variables were explicitly incremented (i++) within the loop bodies. Additionally, the initialization of the loop variable i was moved outside the loops (lines 8, 16), and leading to the correct prediction of the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 4: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 4, several transformations were made to improve prediction accuracy. Initially (lines 3-14), the code declared the array a[100][100] and used compact for loops with combined initialization and iteration. Specifically, the loop on line 7 was used to reset t[i], and the loop on line 13 started with i = 1. After transformation (lines 3-15), the array declaration was updated with specific dimensions by modifying the constants, with expressions like i = (509 - 409) and i = (234 - 233). These changes lead to an accurate model predictions.

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 5: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 5, several transformation were made to improve prediction accuracy. Initially (lines 3-16), the while loop on line 12 iterated with t-- and the base case for gcd_euclid on lines 3-7 used simple conditions. In the transformed version (lines 3-20), the while loop was replaced with a for loop (for(; t--; )), and conditions within gcd_euclid were expanded with modified constant values, such as if (b == (910 - 910)) and if (a >= b && b > (404 - 404)). These changes transformation lead to the correct prediction of the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 6: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 6, several transformations were made to improve prediction accuracy. Initially (lines 2-14), the while loop on line 13 iterated with a-- and the base case for GCD on line 3 used a simple condition (if (b == 0)). In the transformed version (lines 2-15), the while loop was replaced with a for loop (for (; a--; )), and the condition within GCD was expanded with modified constant values (if (b == (948 - 948))). These changes lead to the correct prediction of a "timeout error".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 7: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig 7, several transformation were made to improve prediction accuracy. Initially (lines 7-10), the while loop iterated with t--, and the for loop used a standard initialization and increment pattern. In the transformed version (lines 6-17), the while loop was replaced with a for loop (for(; t--; )), and the initialization for the inner for loop was separated with an increment operation (i++;). These changes lead to the correct prediction of the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 8: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 8, several transformation were undertaken to improve prediction accuracy. Initially (lines 2-3), variables were declared together, and the while loop (line 5) iterated with t--. In the transformed version (lines 2-7), variable declarations were separated, and the while loop was replaced with a for loop (for(; t = t-1;)). These modifications helped the model correctly predict a "timeout error".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 9: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 9, several transformations were made to improve prediction accuracy. Initially (lines 9-20), nested for loops were used for array operations. The outer loop on line 18 was used to decrement i, and the inner loop on line 19 iterated over j with a complex condition. In the transformed version (lines 10-33), these for loops were converted to while loops with explicit increment and decrement operations (i++, j++). This helped the model correctly predict the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 10: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 10, only one transformation was made improve prediction accuracy. Initially (lines 5-24), the while loop iterated with t--. In the transformed version (lines 5-24), the while loop was modified to decrement t within the loop condition (t = t - 1). This change made the model correctly predict the "runtime error".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 11: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Figure 11, several transformations were made to improve prediction accuracy. Initially (lines 2-10, 19-25), the gcd function used a straightforward condition (if (a == 0)) and the lcm function checked for g == 1. In the transformed version (lines 2-10, 19-25), these conditions were replaced with more explicit expressions (if (!(0 != a)) and if (g == (913 - 912))). These changes made the model correctly predict "no defects".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 12: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 12, several transformations were made to improve prediction accuracy. Initially (lines 9-24), variables were declared with specific initial values, and nested for loops were used for array operations. The min variable was initialized to 999999, and the outer loop on line 17 iterated with c++. In the transformed version (lines 9-32), the initial values for variables were changed to expressions like (1000267 - 268) for min and (451 - 451) for c. The nested for loops were converted to while loops with increment operations (i = i + (410 - 409)). These changes model correctly predict the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 13: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code snippet in Fig. 14, several transformations were made to improve prediction accuracy. Initially (lines 5-18), the while loop iterated with t-- and declared multiple variables within the loop. The while loop on line 18 checked remainder != 0. In the transformed version (lines 5-30), the while loop was replaced with a for loop (for(; t--; )), and variable declarations were separated. The second while loop was also replaced with a for loop (for(; remainder != 0; )). These changes helped the model correctly predict the "wrong output".

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 14: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the code snippet provided in Fig. 14, the transformations focus on improving the structure and logic of the loops to correct the model's predictions. Specifically, the initial declaration of the array arr and the loop variables were changed from int i, j, k, n, rows; int arr[1005][1005]; to int arr[1005][1005]; int i; int j; int k; int n; int rows;. Additionally, the loop structures were changed from increment operators like i++ and j++ to expressions like i = i + 1 and j = j + 1.

Before Transformation

(a) Before Transformation

After Transformation

(b) After Transformation

Fig. 15: An example of a Mispredicted Input changed to a Correctly Predicted Input by CodeImrpove

In the provided code in Fig. 15, several transformation were undertaken by CodeImprove to improve prediction accuracy. Initially (lines 2-3, 6-8, 10-14), variables were declared together, loop initializations were concise, and the sum and pos variables were reset within nested loops. After transformation (lines 2-33), variable initializations were updated with modified constants, loops were adjusted (e.g., k = k + 1, i = i + 1), and constants were added to initialization. These changes lead to the correct prediction of a "timeout error".