Suppose I have a NFD text, how do I recompose it back

Suppose I was given an NFD (Normalization Form D (Canonical Decomposition)) text, how would I recompose it back. In other words, if I had "แก ้ว", then I want it recomposed back to "แก้ว". The following java code doesn’t do it.

import;                                                   // [01]
import;                                                       // [02]
import;                                                      // [03]
import java.text.Normalizer;                                                     // [04]
                                                                                 // [05]
public class RecomposeNFD {                                                      // [06]
    public static void main(String args[]) throws Exception {                    // [07]
        BufferedReader reader = new BufferedReader(new FileReader("input.txt")); // [08]
        PrintWriter    writer = new PrintWriter("output.txt");                   // [09]
        String line = null;                                                      // [10]
        while ((line = reader.readLine()) != null) {                             // [11]
            String nfd       = Normalizer.normalize(line, Normalizer.Form.NFD);  // [12]
            String recompose = Normalizer.normalize(nfd,  Normalizer.Form.NFC);  // [13]
            writer.println(line + "_" + nfd + "_" + recompose);                  // [14]
        }                                                                        // [15]
        writer.close();                                                          // [16]
        reader.close();                                                          // [17]
    }                                                                            // [18]
}                                                                                // [19]

for input.txt (UTF-8) of

แก ้ว

enter image description here

using the following command

java -Dfile.encoding=UTF-8 RecomposeNFD

gives the following output:

Line Actual Output Expected Output Flag
1 あ_あ_あ あ_あ_あ As Expected
2 แก้ว_แก้ว_แก้ว แก้ว_แก ้ว_แก้ว Not As Expected
(2nd element)
3 แก ้ว_แก ้ว_แก ้ว แก ้ว_แก ้ว_แก้ว Not As Expected
(3rd element)

When I was creating this test code, I found that Normalizer.normalize(line, Normalizer.Form.NFD); does not decompose as expected in output of line 3.


An editor might represent line 2 of input.txt below as if it is 1 character (moving [, ] or deleting [Delete] by 1 key stroke), but in fact if you look at it in binary there are 2 characters. In this example, there is no Canonical Composition form and line 3 looks as if it is Canonical Decomposition because of the space that exist (as pointed out by Scratte). Furthermore, for some editor, the cursor moves as if there is not space for line 3.

line 1: (U+3042)
line 2: ก้ (U+0E01 U+0E49)
line 3: ก ้ (U+0E01 U+0020 U+0E49)

enter image description here

With an example that has Canonical Composition and Canonical Decomposition representation, the code works as expected.

Input Editor Input Unicode Output Editor Output Unicode
Å U+00C5 Å_Å_Å U+00C5 U+005F U+0041 U+030A U+005F U+00C5
U+0041 U+030A Å_Å_Å U+0041 U+030A U+005F U+0041 U+030A U+005F U+00C5
U+212B Å_Å_Å U+212B U+005F U+0041 U+030A U+005F U+00C5

So basically, the code is fine. Point of advice, make sure to check the input with a binary editor.

Thanks Scratte.

Leave a Reply

Your email address will not be published. Required fields are marked *