Skip to content

Instantly share code, notes, and snippets.

@adjam
Created July 5, 2019 18:03
Show Gist options
  • Save adjam/3118e9e2c2138f30c0f3e2252775ee36 to your computer and use it in GitHub Desktop.
Save adjam/3118e9e2c2138f30c0f3e2252775ee36 to your computer and use it in GitHub Desktop.
icu4j charset detector
import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
public class Detector {
public static void main(String[] args) {
for ( String fileName : args ) {
try (BufferedInputStream bis = new BufferedInputStream(new FileInputStream(fileName))) {
CharsetDetector dt = new CharsetDetector();
dt.setText(bis);
for( CharsetMatch csm : dt.detectAll() ) {
System.out.printf("%s detected encoding: %s with confidence: %d/100%n", fileName, csm.getName(), csm.getConfidence());
}
} catch (IOException ioe) {
System.err.println("Error encountered detecting encoding for " + fileName);
ioe.printStackTrace(System.err);
}
}
}
}
@adjam
Copy link
Author

adjam commented Jul 5, 2019

To compile and run, you will need icu4j jar on the classpath. Assuming a suitable version is in the directory alongside the file,

$ javac -cp icu4j.jar Detector.java
$ java -cp .:icu4j.jar Detector [file, [ file2, ...]

will list out the detected encodings to standard output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment