Showing posts with label unicode. Show all posts
Showing posts with label unicode. Show all posts

11 May 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 185: How Input Sanitization Can Be Used to Counter AI Invisible Prompt Injection)

Prompt Engineering Series
Prompt Engineering Series

Prompt: "write a post of 600 words on how input sanitization can be used AI invisible prompt injection"

Introduction

Invisible prompt injection is one of the most subtle and disruptive vulnerabilities in modern AI systems. It exploits the fact that large language models treat nearly all incoming text as potentially meaningful instructions. When hidden commands are embedded inside documents, images, or metadata, the model may follow them without the user ever noticing. This creates a dangerous gap between what the user thinks they are asking and what the AI is actually responding to. Among the available defenses, input sanitization stands out as one of the most practical and foundational. It does not solve the problem entirely, but it dramatically reduces the attack surface by filtering, normalizing, and constraining the content that reaches the model’s interpretive layer.

The first way input sanitization helps is by removing hidden characters and invisible control sequences. Many prompt injection attacks rely on zero‑width characters, Unicode tricks, or formatting markers that humans cannot see but the model interprets as part of the prompt. These characters can smuggle instructions into otherwise harmless text. Sanitization routines that strip or normalize these characters prevent the model from reading them as meaningful input. This is similar to how web applications sanitize user input to prevent hidden SQL commands from being executed. By reducing the 'invisible' portion of the input, sanitization makes it harder for attackers to hide instructions in plain sight.

A second benefit comes from filtering out hidden markup and metadata. Invisible prompt injection often hides inside HTML comments, alt‑text, EXIF metadata, or other fields that users rarely inspect. When an AI system ingests a webpage, document, or image, it may treat these hidden fields as part of the prompt. Sanitization can remove or neutralize these elements before they reach the model. For example, stripping HTML tags, flattening markup, or removing metadata ensures that only the visible, user‑intended content is passed to the AI. This prevents attackers from embedding instructions in places that humans cannot easily detect.

Another important role of sanitization is normalizing the structure of the input. Many prompt injection attacks rely on breaking the expected structure of the prompt - introducing unexpected delimiters, injecting new instruction blocks, or manipulating formatting to confuse the model. Sanitization can enforce a consistent structure by collapsing whitespace, removing unusual delimiters, or reformatting the input into a predictable template. This reduces the model’s exposure to structural manipulation and makes it harder for attackers to smuggle in new instruction boundaries.

Input sanitization also supports context isolation, a broader architectural strategy. By sanitizing external content before it is combined with user instructions, systems can ensure that only the user’s explicit prompt influences the model’s behavior. For example, if a user uploads a document for summarization, sanitization can remove any embedded instructions before the document is passed to the model. This prevents the document from overriding the user’s intent. Sanitization becomes a gatekeeper that separates trusted instructions from untrusted content.

A further advantage is reducing ambiguity, which is often exploited in invisible prompt injection. When input is messy, inconsistent, or contains mixed signals, the model may latch onto the wrong part of the text. Sanitization that clarifies formatting, removes noise, and enforces consistency helps the model focus on the intended content rather than on accidental or malicious artifacts. Cleaner input leads to more predictable behavior.

Finally, input sanitization is valuable because it is scalable and proactive. It does not require detecting every possible attack pattern. Instead, it reduces the overall complexity of the input space, making it harder for attackers to exploit obscure or unexpected pathways. While sanitization cannot eliminate invisible prompt injection entirely, it forms a crucial first line of defense - one that strengthens other safeguards such as retrieval grounding, context isolation, and self‑critique mechanisms.

Invisible prompt injection is a structural challenge, but input sanitization offers a practical, effective way to reduce its impact. By filtering, normalizing, and constraining the content that reaches AI systems, we can build more resilient models that remain aligned with user intent - even when confronted with hidden manipulation.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

08 December 2010

💎🏭SQL Reloaded: Pulling the Strings of SQL Server IX (Special Characters)

Under special characters denomination are categorized typically the characters that don’t belong to the alphabet of a given language, typically English (a-z, A-Z), or the numeric digits (0-9). Characters like umlauts (e.g. ä, ë, ö, ü, ÿ), accents (e.g. é, í, ó, ú) and other type of characters used in specific languages like German, French, Turkish, Hungarian, Romanian, Dutch or Swedish, together with punctuation signs or graphical characters are falling under the designation of special characters. It’s true that SQL Server, like many other databases, supports Unicode, a standard designed to encode such characters, though not all database designs are taking Unicodes into account. 

Preponderantly the old database and other software solutions use the non-unicode string data types (char, varchar, text) fact that makes the special characters to be displayed inadequately, sometimes undecipherable. In order to avoid this behavior could be decided to port the data types to unicode or use only the standard characters of English language, both solutions with their positive and negative aspects. In such cases, especially during migration project or ETL tasks, eventually as part of a Data Quality initiative, it’s preferred to identify and replace the special characters in a manual, automatic or semi-automatic fashion. In addition, there are also cases in which, from various reasons, some attributes are not allowed or should not include special characters, and also this aspect could be included in a Data Quality initiative.

During assessment of Data Quality, in an organized or less organized manner, a first step resides in understanding the quality of the data. In the current case this resumes primarily in identifying how many records contain special characters, how could be the data cleaned, and the “cost” for this activity. Actually, before going this far, must be defined the character sets included in special characters, the definition could vary from case to case. For example in some cases could be considered also the space or the important punctuation signs as valid characters, while in others they may not be allowed. 

There could be identified thus multiple scenarios, though I found out that the range of characters a-z, A-Z, 0-9 and the space are considered as valid character in most of the cases. For this purpose could be built a simple function that iterates through all the characters of a string and identifies if there is any character not belonging to the before mentioned range of valid characters. In order to address this, a few years back I built a function similar with the below one:

-- checks if a string has special characters 
CREATE FUNCTION dbo.HasSpecialCharacters( 
@string nvarchar(1000)) 
RETURNS int 
AS  
    BEGIN 
         DECLARE @retval int 
         DECLARE @index int 
         DECLARE @char nchar(1)  

         SET @retval = 0 
         SET @index = 1  

         WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
        BEGIN  
           SET @char = Substring(@string, @index, @index+1) 
           IF NOT (ASCII(@char) BETWEEN 48 AND 57 -- numeric value 
             OR ASCII(@char) BETWEEN 65 AND 90 -- capital letters 
            OR ASCII(@char) BETWEEN 97 AND 122 -- small letters 
            OR ASCII(@char) = 32) --space 
           BEGIN 
                SET @retval = @index 
           END 
           ELSE 
               SET @index = @index + 1  
     END  

    RETURN (@retval) 
END 

Function’s logic is based on the observation that the ASCII of numeric values could be found in the integer interval between 48 and 57, the capital letters between 65 and 90, while the small letters between 97 and 122. By adding the ASCII for space and eventually several other characters, the check on whether an character is valid resuming thus to only 4 constraints. Here’s the function at work:  

-- testing HasSpecialCharacters function 
SELECT dbo.HasSpecialCharacters('kj324h5kjkj3245k2j3hkj342jj4') Example1 
, dbo.HasSpecialCharacters('Qualität') Example2 
, dbo.HasSpecialCharacters('Änderung') Example3 
, dbo.HasSpecialCharacters('') Example4 
, dbo.HasSpecialCharacters(NULL) Example5 
, dbo.HasSpecialCharacters('Ä') Example6 
, dbo.HasSpecialCharacters('ä') Example7 
, dbo.HasSpecialCharacters('a') Example8 
 
special characters 1

As can be seen, the function returns the position where a special character is found, fact that enables users to identify the character that causes the problem. A similar function could be built also in order to count the number of special characters found in a string, the change residing in performing a counter rather then returning the position at the first occurrence of a special character.

The function might not be perfect though it solves the problem. There are also other alternatives, for example of storing the special characters in a table and performing a simple join against the target table. Another solution could be based on the use RegEx functionality, either by using OLE automation or directly CLR functionality. There could be done variations on the above solution too by limiting to check on whether the characters of a string are falling in the range projected by the ASCII function. That’s what the following function does:  

-- checks if a string has special characters falling in an interval  
CREATE FUNCTION dbo.HasCharNotInASCIIRange( 
@string nvarchar(1000) 
, @start int 
, @end int) 
RETURNS int 
AS  
BEGIN      
    DECLARE @retval int 
    DECLARE @index int 
    DECLARE @char nchar(1) 
 
    SET @retval = 0 
    SET @index = 1 
    WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
    BEGIN  
         SET @char = Substring(@string, @index, @index+1) 
         IF NOT (ASCII(@char) BETWEEN @start AND @end)  
        BEGIN 
              SET @retval = @index 
        END 
        ELSE 
             SET @index = @index + 1  
     END 
     RETURN (@retval) 
END 

With this function are necessary 4 calls in order to identify if a string contains special characters, though we loose the flexibility of identifying the first character that is invalid. We could still identify the first occurrence by taking the minimum value returned by the 4 calls, however, unlike Oracle (see Least function), SQL Server doesn’t have such a function, so we’ll have eventually to built it. Anyway, here’s the above function at work:  

-- testing HasCharNotInASCIIRange function 
SELECT dbo.HasCharNotInASCIIRange('k12345', 48, 57) Example1 
, dbo.HasCharNotInASCIIRange('12k345', 48, 57) Example2 
, dbo.HasCharNotInASCIIRange('12345', 48, 57) Example3 
, dbo.HasCharNotInASCIIRange(' 12345', 48, 57) Example4 
, dbo.HasCharNotInASCIIRange('12345 ', 48, 57) Example5 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example6 
, dbo.HasCharNotInASCIIRange(NULL, 48, 57) Example7 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example8 
, dbo.HasCharNotInASCIIRange('a', 48, 57) Example9 
, dbo.HasCharNotInASCIIRange('Ä', 48, 57) Example10 
, dbo.HasCharNotInASCIIRange('ä', 32, 32) Example11 

special characters 2


Notes:
The queries work also in SQL databases in Microsoft Fabric

Happy coding!

07 December 2010

💎SQL Reloaded: Pulling the Strings of SQL Server I 7- Introduction

    The (character) string or simply the character data type, how is named by the MSDN documentation, is one of the primary data types available in a database and probably the most complex given the fact that it can encompass any combination of numeric, literal, date or any other primary data type. In fact it could include any type of chunk of text that could be written in any language as SQL Server supports Unicode and thus most of the (written) languages. In this post I’m not intending to make a complete anthology of strings, and neither to retake the whole volume of information available in MSDN or other important books. My intent is to look at strings from a slightly different perspective, considering the various functions involving strings and how they could be used in order to provide various functionality, in fact the cornerstone of everyday developer.

   A few things to remember:

1. there were mainly two types of non-unicode strings: the char (or character) of fixed length, respectively the varchar of variable length (varying character)

2. if initially both types of strings were having a maximum length of 8000 of characters, with SQL Server it’s possible to have a varchar with maximum storage size, declared as varchar(max).

3. if in the past there were some strict recommendations in what concerns the use of char or varchar, nowadays the varchar tends to be used almost everywhere, even for the strings of length 1.

4. talking about length, it denotes the number of chracters a string stores.

5. the trailing spaces, the spaces found at the right extremity of a string are typically ignored, while the leading spaces, the spaces found at the left extremity, are not ignored.

6. starting with SQL Server 2000, for the two character data types were introduced the corresponding unicode data types prefixed with n (national): nchar, respectively nvarchar.

7. given the fact that a unicode character needs more space to store the same non-unicode string, actually the number of bits doubles, the maximum length for an unicode string is only 4000.

8. there is also a non-unicode text, respectively ntext unicode data type, designed to store maximum length, though as it seems they could become soon deprecated, so might be a good idea to avoid it.

9. not-initialized variables, including strings, have the value NULL, referred also the NULL string, therefore it’s always a good idea to initialize your variables.

10. by empty string string is designated the string containing no character “’’”, and has the length 0.

11. there are several specific functions available for the creation, manipulation and conversion of strings from and to other data types.

12. not all of the aggregated functions work with string values (specifically the ones requesting a number value like SUM, AVG, STDV).

13. the operations performed on strings of different data types are generally not impacted by this aspect, though there are some exceptions.

14. there are several (basic) operations with strings, typically concatenation, extraction of subpart of a string, insertion, replacement or deletion of characters, rearrangement of string’s characters, trimming, splitting it in substrings (decomposition), etc.

15. there are several numeric values based on strings: length, position of a given text in a text, number of not empty characters, encoding of a character, on whether the text represents a valid numeric or date values, etc.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.