This document outlines some ideas for document conversion on Linux and Mac OS X platforms using command line tools. Distribute documents as plain text using UTF-8 encoding whenever possible. Everyone should embrace the mantra "plain text is beautiful".
Use file
command to obtain basic metadata for most file formats. For image files make sure you have ImageMagick installed, then use identify
command to extract image metadata.
Use iconv
command to convert plain text from one encoding to another. The basic usage is
$ iconv -c -f <source_encoding> -t <target_encoding> input.txt > output.txt
The -c
option discards unconvertible characters, and pointy brackets denote required options. For a list of supported encodings run
$ iconv -l
Poppler library (https://poppler.freedesktop.org/), based on Xpdf, comes with a suite of PDF tools. Use pdftotext
command to extract text from PDF file, assuming a text layer exists.
Use html2text
command (http://www.mbayer.de/html2text/) to extract text from HTML file.
DjVuLibre (http://djvu.sourceforge.net/), an open source DjVu library and viewer, comes with a suite of command line utilities. Use the djvutxt
command to extract text from DjVu, assuming a text layer exsits.
UnRTF (https://www.gnu.org/software/unrtf/)
Install xml-twig-tools
package.
Use xml_grep
to extract text from XML document:
xml_grep example.xml --text_only
Extract text only from mytag
tag:
xml_grep 'mytag' example.xml --text_only
Use textutil
command to convert plain text to rtf
, rtfd
, html
, doc
, docx
, odt
, wordml
, and webarchive
formats. The -info
option extracts basic metadata from files of these formats. textutil
is based on the Cocoa Framework, so it isn't available on Linux.
Use cupsfilter
command to convert non-PDF formats to PDF.
Use enscript
command (http://www.linuxfromscratch.org/blfs/view/svn/pst/enscript.html) to convert text files to PostScript, HTML, and RTF. Unfortunately, enscript
does not support UTF-8 encoding.
Use paps
command (http://paps.sourceforge.net/) to format UTF-8 plain text files. paps
requires the Pango library (http://www.pango.org/).
Use pandoc
command to convert amongst popular markup formats:
http://pandoc.org/
Note that pandoc supports the newer XML-based docx
MS Word format but not the older OLE-based doc
MS Word format.
Use textutil
command to convert among txt
, rtf
, rtfd
, html
, doc
, docx
, odt
, wordml
, and webarchive
formats.
Use cupsfilter
command to convert TXT to PDF and HTML to PDF.
If you have LibreOffice installed on your system, you can run soffice
command in headless mode to convert documents:
$ soffice --headless --convert-to <TargetFileExtension>[:<NameOfFilter>] input_file.xxx
Note that the square brackets around :<NameOfFilter>
mean that this part is optional. The output file will be named input_file.TargetFileExtension
. On Windows command line, the convert-to
parameter uses only one dash.
Please refer to LibreOffice documentation for details: https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
Use pstopdf
command to convert PostScript to PDF.
Use djvutoxml
command from DjVuLibre library (http://djvu.sourceforge.net/) to convert DjVu to XML.
Use UnRTF to convert RTF files to HTML files. UnRTF also supports LaTeX and ASCII plain text output.