Are Current Inline DLP Solutions Falling Short for GenAI/LLM Transactions?
What is inline DLP?
Inline Data Loss Prevention (DLP) solutions are designed to detect and prevent the unauthorized transmission of sensitive information, such as Personally Identifiable Information (PII), Protected Health Information (PHI) and other enterprise-specific sensitive data. These solutions operate at the network level, providing comprehensive visibility and control over data flowing across the organization’s network to various endpoints, including the internet, employee devices, SaaS applications, and enterprise applications.
By integrating inline DLP capabilities, Secure Access Service Edge (SASE) and Security Service Edge (SSE) solutions deliver a robust, cloud-delivered network security service. These solutions utilize various types of proxies—forward, transparent, and reverse—to intercept and inspect network traffic, ensuring that sensitive information is adequately protected along with various other security functions.
How does inline DLP work?
Inline DLP solutions utilize multiple techniques to detect and stop the sensitive information leaking. Some of the techniques are listed below.
Content analysis: This technique involves examining the type of file or category of content to determine if it falls under sensitive information. It helps in identifying sensitive content by analyzing its structure and properties.
Pattern matching: It involves using a set of predefined regular expressions to identify patterns associated with sensitive information, such as social security numbers or credit card details. By recognizing these patterns, inline DLP solutions can control the transmission of sensitive data. Inline DLP solutions also allow the definition of custom identifiers using regular expressions to detect specific sensitive information within an enterprise.
Watermark detection: Enterprises can embed unique identifiers or watermarks within documents. Pattern matching techniques can then detect these watermarks, ensuring the protection of proprietary information.
Metadata inspection: Document metadata, such as author information, creation date, and classification labels, can be analyzed to identify sensitive data. This technique helps in mapping documents to sensitive categories based on their metadata attributes.
Fingerprint analysis: This technique involves creating digital fingerprints (such as hashes, checksums and other cryptographic fingerprints) of sensitive content, which are then used to detect and block the transfer of similar content.
The above techniques are combined with contextual analysis to minimize false positives. Some of the contextual techniques include:
– User Context: Identifying the communicating user and their claims, such as group, role, etc.
– Service Context: Identifying the destination service.
– Network Context: Identifying networks being used to send/receive traffic.
– Device Posture Context: Determining the security posture of the endpoint from which the user is communicating.
By integrating context with detected sensitive information, inline DLP solutions reduce false positives significantly, thereby decreasing the number of alerts. For instance, there is no need to consider sensitive data detection on a transaction if it is expected for a given user to interact with a specific service. As an example, customer support personnel accessing customer-specific sensitive data is not regarded as a data leak.
Challenges in detecting leaks in LLM transactions
The primary security risk of GenAI/LLM lies in the potential exposure of sensitive data. Employees may inadvertently share confidential, proprietary, or private information with GenAI /LLM tools while focused on their tasks.
Real world data leaks/copy right violations specific to AI/LLM:
- Expose data like in the case of the famous Samsung data leak
https://mashable.com/article/samsung-chatgpt-leak-details - Reproduction of copyrighted content in a blog post that caused the New York Times to sue OpenAI https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
Detecting sensitive data exposure in LLM transactions (in both prompts and responses) presents unique challenges that traditional DLP tools are not equipped to handle. Let’s explore these unique characteristics of LLM transactions and the related challenges in the context of DLP use cases.
Further Encapsulations on HTTP Request & Response Data:
Traditionally, each HTTP transaction with request/response data involves a single JSON or XML formatted data. However, in LLM transactions, it is observed that the WebSocket mechanism is utilized. Within a WebSocket message, one typically finds the JSON formatted response data. Furthermore, there can be multiple JSON messages within a single WebSocket message, and it is also possible for one JSON message to span across multiple WebSocket messages.
Some of the current generation Data Loss Prevention (DLP) solutions do not interpret WebSocket messages in HTTP transactions. Instead, they bypass the traffic upon encountering WebSocket data. Consequently, these DLP solutions may fail to detect sensitive information in LLM transactions that employ WebSockets.
Streaming:
To enable real-time visual rendering to the end users, many LLM service providers send the LLM-generated data as and when it is known. Some LLM services use the Server-Sent Events mechanism to push incremental data to the client over a single HTTP connection. Others employ the WebSocket mechanism, which supports bidirectional streaming capability to send incremental data to clients. Additionally, some services utilize HTTP/2 and HTTP/3 protocols, which allow multiplexing multiple streams over a single connection, thereby ensuring more efficient and faster data transmission. Furthermore, technologies like gRPC, which leverage HTTP/2’s streaming capabilities, are increasingly being adopted for their ability to handle complex data interactions in real-time.
Some inline-DLP solutions bypass all security functions, including DLP, when processing HTTP/2 or HTTP/3 traffic. This can result in failure to detect sensitive information.
Incremental data in LLM transactions is transmitted not at the sentence or paragraph level, but rather at the token level. Consequently, inline DLP tools, which act as intermediaries, only see individual tokens. Detecting sensitive information by analyzing these tokens, which typically consist of 3 to 4 characters, is not feasible. As a result, many current inline DLP solutions fail to identify sensitive information effectively in such cases, where LLM streaming is enabled.
Data duplication in LLM transactions:
Some LLM services, instead of sending new tokens incrementally, resend the new token along with all previous tokens. This approach supports clients that can’t buffer data and simplifies implementation. However, this mechanism can result in inline DLPs encountering the same data multiple times.
For instance, if a credit card number is included in the data stream, it might appear repeatedly due to this duplication method. Inline DLP solutions often lack the capability to deduplicate the data before applying detection mechanisms. This means that sensitive information, such as a credit card number, could be flagged multiple times, leading to inefficiencies and potential oversight in data protection.
To illustrate, consider a scenario where an LLM sends the following tokens:
1. First message: “1234”
2. Second message: “1234 5678”
3. Third message: “1234 5678 9012”
Each subsequent message includes all previous tokens. An inline DLP solution without deduplication capabilities will process “1234” three times, “5678” twice, and “9012” once. This redundancy can hinder the detection process and compromise the efficiency of data security measures.
Natural Language:
Prompts to LLMs and responses from LLMs are texts in natural language. Due to its nature, identifiers can be represented in many ways. Pattern matching may not work well in such cases and hence requires additional help in recognizing the types of identifiers in natural text. For example, a person can be represented in many ways. To recognize a person’s name or any other types in the text, NER (Named Entity Recognition) models may be required. Many DLP systems are mostly based on regular expression-based pattern matching and hence can generate many false positives or may even miss the detection.
Consider the following examples of how named entities, particularly persons, can be represented
1. Name Variations: “John Smith”, “Mr. J. Smith”, “Dr. Johnathan Smith”
2. Nicknames and Aliases: “Johnny”, “J.S.”
3. Different Contextual References: “John, the project manager”, “Smith from the marketing department”
4. Titles and Honorifics:”President John Smith”, “Professor Smith”
5. Non-Standard or Mis formatted Entries: “J0hn Sm1th”, “J_Smith”
Another example is about organizations and locations. They also can be represented in many ways as shown below.
1. Organizations: “Google Inc.”, “Ggl”, “Alphabet’s subsidiary”
2. Locations: “New York”, “NYC”, “The Big Apple”
Regular expressions can be difficult to create for detecting variations. NER models like Spacy, Stanza, Google NLP Entity models and domain specific named entity models offer more accurate detection. Many DLP systems lack these capabilities today to leverage NER models.
Nuances of LLM responses
Responses from LLMs, while appearing seamless, encapsulate a plethora of nuances. For instance, these models can generate multiple drafts or iterations of a response, offering users a selection to choose from. This presents challenges for inline DLP solutions, which might struggle with detecting sensitive information accurately. Analyzing all drafts together as one logical text could lead to duplication of detected entities and inflated counts of potential data leaks, while examining only a single draft might miss sensitive data present in other versions.
Additionally, LLMs often incorporate various references, including URLs from external websites, internal document references, email contents, and SharePoint documents. They also suggest prompts to guide users in refining their queries or responses. If these references and suggestions are included in the same text, it can result in false positives during data detection.
Current inline DLP solutions lack the capability to interpret and differentiate between multiple drafts or iterations generated by LLMs. They also struggle to distinguish between generated text and incorporated references or suggestions, leading to ineffective data protection measures.
Other nuances of LLM-generated text include:
Summarization and Paraphrasing:
LLMs can summarize or paraphrase the data, which might change the original wording but retain sensitive information. This can make it challenging for detection tools to identify the paraphrased sensitive information accurately.
For example, if the original text contains the sensitive information “John Doe’s Social Security Number is 123-45-6789,” an LLM might paraphrase it to “The SSN of John Doe is 123456789.” While the format and wording have changed, the sensitive information remains intact.
Another instance could be a document stating, “The acquisition cost was $5 million.” An LLM might summarize this as “The purchase price was five million dollars,” which, although differently worded, conveys the same sensitive information.
Inline DLP solutions today do not address these variations effectively.
Embedded Data:
LLMs can embed data within text in ways that are not immediately obvious, such as hidden metadata or encoded information, which traditional DLP tools might miss.
An example of encoded information could be a piece of text where sensitive data is converted into a string of characters using a specific encoding scheme. For instance, the phrase “Confidential Project X” might be encoded as “Q29uZmlkZW50aWFsIFByb2plY3QgWA==” using Base64 encoding. While the actual words are transformed into an unintelligible format, they can still be decoded to reveal the original sensitive content.
Inline DLP solutions may not detect sensitive information unless they perform decoding to address these cases.
Absence of cascading detections within the text:
Current inline DLP solutions support the detection of PII/PHI and custom enterprise-specific data. However, they fall short when it comes to secondary, tertiary, and other nested detections. For example, consider a scenario where one needs to track different citations referenced by LLM-generated text. Typically, LLMs provide URL references and additional details about each citation, such as the sender, recipients, and subject of an email if the citation pertains to emails. Detecting specific URLs or email subjects within these nested citations is currently beyond the capabilities of many DLP systems.
Nested detections, while not exclusive to LLM transactions, offer significant value in such contexts. Imagine defining a DLP rule that allows for a hierarchy of detections. This would enable organizations to identify specific pieces of information within broader datasets, enhancing their ability to secure sensitive data effectively.
Furthermore, many DLP systems do not log the information they detect. While it is crucial to avoid logging sensitive data, secondary information such as URLs and email addresses, though sensitive, may need to be logged for further analysis. Of course, Enterprises shall be in charge of which sensitive information can be logged.
For instance, knowing all users accessing LLMs that cite specific URLs or SharePoint documents could help identify permission issues within those documents. Proper logging is vital for troubleshooting larger security problems, such as overly permissive access settings.
Consider another scenario where an LLM generates a report including a citation of a confidential financial document stored on a public cloud service. The DLP system should not only detect the citation but also log the access to this document, enabling security teams to investigate and resolve any potential exposure of sensitive financial data.
Essential supplementary mechanisms needed in inline DLP to manage data access through LLMs
Inline DLP solutions for LLMs should have the following capabilities to address the challenges discussed.
Inline DLP solutions should be able to interpret not only HTTP/1.1, but also HTTP/2 and HTTP/3. At the HTTP level, inline DLP solutions must identify each session, especially for HTTP/2 and HTTP/3 where transactions can be multiplexed over a single TCP connection. Inline DLP solutions are also required to interpret WebSocket messages, gRPC, and Server-Sent Events to extract content, buffer the right amount of data before performing detection operations, and ensure seamless user experience without causing disruptions.
To address one of the challenges described above, inline DLP solutions that support LLM transactions should implement effective data deduplication processes before performing detection functions. By identifying and removing duplicate data, these solutions can minimize false positives and reduce the number of unnecessary alerts.
Inline DLP solutions need to incorporate Named Entity Recognition (NER) models to accurately detect various entities, even when these entities are presented in different forms within natural language text. This capability is crucial for identifying sensitive information that might be obfuscated or embedded within complex or nested text structures.
Next generation inline DLP solutions are also expected to have sophisticated parsers for various LLM services that can accurately extract and categorize information before detection operations. For instance, these parsers should be capable of identifying different drafts, references, and suggestions, allowing specific detection rules to be applied based on the type of content. By organizing information into distinct categories, these advanced DLP solutions can reduce false positives and enhance detection accuracy, thereby addressing the nuanced challenges of managing data access through LLMs.
Since sensitive data can be represented in many ways, it is important to implement similarity analysis, such as cosine similarity checks, in addition to pattern and NER-based detections on critical text segments to identify sensitive information effectively. These methods enable the identification of semantically similar or slightly altered sensitive data, ensuring comprehensive data protection.
Implementing advanced decoding capabilities within DLP systems is essential for detecting encoded sensitive information. By incorporating algorithms that automatically decode common encoding schemes such as Base64, hexadecimal, and others, DLP tools can identify hidden sensitive data effectively. This will ensure that sensitive information, even when encoded, is detected and protected.
Developing hierarchical detection rules can significantly improve the ability of DLP systems to identify nested sensitive information. By allowing for cascading detections, where primary rules trigger secondary and tertiary checks, organizations can track complex data structures such as citations in LLM-generated texts. These rules can help in identifying sensitive URLs, email subjects, and other nested information that traditional DLP systems might miss.
Enhancing logging mechanisms to include non-sensitive yet crucial information, such as URLs and email addresses, can provide valuable insights for security analysis. Organizations should define clear policies to determine which data can be logged, ensuring compliance with privacy regulations. Proper logging will enable security teams to monitor data access patterns, identify potential security issues, and take proactive measures to mitigate risks.
Conclusion
In conclusion, while current inline Data Loss Prevention (DLP) solutions provide robust mechanisms for detecting and preventing unauthorized transmission of sensitive information, they fall short in addressing the unique challenges posed by GenAI and LLM transactions. The traditional techniques such as content analysis, pattern matching, watermark detection, metadata inspection, and fingerprint analysis are effective to an extent but struggle with the complexities introduced by LLM transactions.
LLM transactions often involve further encapsulations on HTTP request and response data, streaming of incremental data at the token level, and data duplication. These factors make it difficult for existing DLP solutions to accurately detect sensitive information without generating false positives. Additionally, the natural language nature of LLM prompts and responses, along with the nuances of LLM-generated text, further complicate the detection process.
To address these challenges, next-generation inline DLP solutions must incorporate advanced capabilities such as interpreting HTTP/2 and HTTP/3, effective data deduplication processes, Named Entity Recognition (NER) models, sophisticated parsers for various LLM services, similarity analysis, and advanced decoding capabilities. Implementing hierarchical detection rules and enhancing logging mechanisms are also crucial for improving the accuracy and efficiency of DLP systems in managing data access through LLMs.
By evolving to meet these requirements, inline DLP solutions can better protect sensitive information in the dynamic and complex landscape of GenAI and LLM transactions, ensuring comprehensive data security and compliance.