Explore actionable strategies and best practices to safeguard your code against the emerging risks of AI-generated vulnerabilities.
The increased capabilities of generative artificial intelligence (genAI) with large language models (LLMs) have introduced the ability to quickly create AI-generated code. In an effort to increase productivity and efficiency, many development teams have been asked to look into and use these tools, specifically the ones that focus on AI-powered code generation. GitHub Copilot, Codeium, and Amazon CodeWhisperer offer many potentially appealing features, including suggestions and code completion, and even go beyond that into writing more complete software components or even entire applications. Unfortunately, this ability to rapidly produce code also introduces new security challenges due to the potential for AI-generated vulnerabilities. Indeed, a recent study of the security of GitHub Copilot’s generated code by the Institute of Electrical and Electronics Engineers (IEEE) showed that it included a significant number of vulnerable code suggestions (27.25%).
Let’s look at the risks associated with AI-generated code now emerging and how you can safeguard your software against these vulnerabilities.
AI-Generated Code: How It Introduces Risk
AI-generated code is created using machine learning (ML), LLMs, and natural language processing (NLP). Tools in this space are trained on large datasets of existing code (such as the questions, answers, and examples in Stack Overflow) to understand programming syntax, conventions, patterns, and even best practices. Some LLMs, such as OpenAI Codex or Google’s PaLM 2, use transformer architectures to process and generate humanlike text and code, leveraging pre-training on general data sets but then adding fine-tuning based on specific programming-related data. These architectures may improve accuracy, but may not deliver a high level of quality or security.
One of the biggest issues with AI-generated code is the potential for it to introduce security risks. These risks can manifest in a variety of ways:
- Embedded vulnerabilities: AI models are trained on huge datasets, including open source code repositories and other third-party sources, and may even rely on outdated libraries or frameworks with well-known vulnerabilities.
- Insecure code generation: AI tools may generate static code patterns that attackers recognize and know how to exploit.
- Lack of contextual awareness: AI models may lack information about an application’s security requirements or organizational best practices, producing code without proper input validation.
- Lack of visibility: The AI decision-making process for code generation is frequently a black box; even the developers building the AI tools may not have complete visibility into how the AI determines what to produce. That makes AI-generated code difficult to predict and may inadvertently introduce bugs or vulnerabilities.
- Data poisoning attacks: A malicious actor could manipulate the training data for AI models by injecting malicious samples or create a backdoor by exploiting configuration files.
- Rapid deployment: Unlike human-generated code, it’s possible to produce AI-generated code far more quickly than security teams can review it.
- Lack of oversight: Unless you enforce security validation of AI-generated code, vulnerable code may be pushed live, putting your applications at risk.
- Feedback loops: Newer AI training models may use insecure AI-generated code, creating a feedback loop that allows vulnerabilities to persist and even spread over time.
Potential Vulnerabilities Created
AI-generated code has the potential to introduce a range of vulnerabilities that compromise the security, functionality, and integrity of your software applications. Here are a few of the most common vulnerabilities and risks:
Common Security Vulnerabilities
- SQL injection: Failing to sanitize user inputs properly, leaving applications vulnerable to database exploitation
- Cross-site scripting (XSS): Mishandling input validation, allowing attackers to inject malicious scripts into web applications
- Command injection: Handling system commands improperly, allowing attackers to execute arbitrary commands on the server
- Missing input validation: Generating APIs and other components with proper input validation, increasing the risk of injection attacks and data breaches
Business Logic Vulnerabilities
- Authorization bypasses: AI-generated code may omit important authorization checks, allowing attackers to manipulate business processes or gain unauthorized access
- Logic errors: AI-generated code generally lacks contextual understanding of specific business requirements, leading to vulnerabilities that are difficult to detect but enable malicious actors to exploit application-specific logic
Supply Chain Vulnerabilities
- Outdated or vulnerable dependencies: AI tools may reference libraries that contain known vulnerabilities and propagate them in your applications
- Rules file backdoors: Attackers may be able to manipulate configuration files used by AI models to inject malicious instructions, thus creating persistent backdoors that affect future code generation
Data Exposure Risks
- Hardcoded secrets: AI tools may embed API keys, credentials, or other sensitive data directly into the code
- Exposed APIs: AI tools may generate poorly secured APIs that lack authentication or authorization mechanisms
Intellectual Property and Privacy Risks
- Copyright infringement: Models trained on publicly available repositories can assemble code that is very similar to existing software without adhering to licensing terms, putting your IP at risk
- Data leakage: Sensitive information included in training data (such as proprietary codebases) may be reproduced in generated AI outputs
How to Safeguard Against AI-Generated Vulnerabilities
Given the ease and speed with which teams can generate application code using AI, AI-generated code is not going away. Your AppSec team must therefore be prepared to prevent AI-generated vulnerabilities from compromising your applications. To stay ahead of increasingly aggressive software release cycles, AppSec teams need more than traditional security tools and approaches.
Traditional approaches recommend a thorough review of AI-generated code by experienced developers who understand the security implications of the code and broader business logic, but the reality is that checking for known vulnerabilities and verifying that the code adheres to security best practices without automated tools simply isn’t possible. Here are five ways you and your team can quickly and effectively safeguard your code:
- Run Static Application Security Testing (SAST) and software composition analysis (SCA) tools to identify potential vulnerabilities in AI-generated code.
- Adopt secure-by-design approaches to incorporate security considerations from the beginning of the application development process. Adopting secure coding guidelines and training developers in secure coding practices can reduce the likelihood of vulnerabilities being deployed to production environments.
- Monitor for vulnerabilities continuously and apply updates and patches quickly when vulnerabilities are identified, for both AI-generated and human-generated code.
- Use AI to identify and prioritize vulnerabilities in code in order to accelerate the response to and remediation of security threats.
- Use an AI Bill of Materials (AI-BOM) to ensure you have a comprehensive inventory of all the hardware, software, data, and pipeline components in AI systems. Similar to a Software Bill of Materials (SBOM), the AI-BOM is tailored to AI systems to provide transparency, traceability, and security insights into the AI supply chain.