Html2txt: Difference between revisions

From unkrig.de
Jump to navigation Jump to search
(Created page with "A tool to convert HTML documents into plain text. Html2txt is written in Java; it is available as a command line tool and as an APACHE ANT task. Some HTML elements are conve...")
 
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
A tool to convert HTML documents into plain text.
A tool to convert HTML documents into plain text.


Html2txt is written in Java; it is available as a command line tool and as an APACHE ANT task.
For example this HTML code


Some HTML elements are converted into "markup" characters, e.g.
[[File:Main.main.jpg]]


<pre>This is a <var>variable</var></pre>.
is rendered like this:


converts into
[[File:usage.txt.jpg]]


<pre>This is a &lt;variable&gt;</pre>
For a complete description of the supported HTML inline elements, see
<span class="plainlinks">[http://html2txt.unkrig.de/javadoc/de/unkrig/html2txt/Html2Txt.html#ALL_INLINE_ELEMENTS here]</span>.
 
For a complete description of the supported HTML block elements, see
<span class="plainlinks">[http://html2txt.unkrig.de/javadoc/de/unkrig/html2txt/Html2Txt.html#ALL_BLOCK_ELEMENTS here]</span>.
 
== Motivation ==
 
The goal was to generate the "usage" page that a command line tool usually prints when you invoke it with a "<tt>-help</tt>" or "<tt>--help</tt>" option, rather than maintain it manually (e.g. in the form of "<tt>println()</tt>" statements in the code).
 
The chosen solution is to put a big DOC comment before the "<tt>main()</tt>" method, generate an HTML page with JAVADOC, convert that into a plain text file, put it into the application's JAR file and copy its contents to STDOUT when the user want to see it.
 
The command line version of <tt>html2txt</tt> itself uses that technique, and you can see the results above.
 
== Download ==
 
You can download the latest version of the runnable JAR file [https://repository.sonatype.org/service/local/artifact/maven/redirect?r=central-proxy&g=de.unkrig&a=html2txt&v=LATEST&c=jar-with-dependencies here].
 
== Limitations ==
 
Since the tool uses the JRE's built-in XML parser, it supports "numeric character references" (like "&amp;#252;" for "Ü"), but not "named HTML character entity references" (like "&amp;Uuml;" for "Ü").


, other elements are simply ignored because they cannot reasonably be converted into text.
For the same reason, the HTML markup in the DOC comments must be "well-formed", i.e. all start tags must be matched by an end tag (like "<code>&lt;li>...&lt;/li></code>"), and void tags must end with a slash, like "<code>&lt;br /></code>".


For a complete description of the supported HTML inline elements, see
== Usage ==
<span class="plainlinks">[http://html2txt.unkrig.de/javadoc/src-html/de/unkrig/html2txt/Html2Txt.html#line.1269 here]</span>.
 
=== Command line tool ===
 
see [http://html2txt.unkrig.de/Main.main(String%5b%5d).html here].
 
=== ANT task ===
 
see [http://html2txt.unkrig.de/antdoc/index.html here].
 
=== Library ===
 
see [http://html2txt.unkrig.de/javadoc/index.html the JAVADOC].
 
=== Source Code ===
 
see [https://github.com/aunkrig/html2txt the source code repository].
 
== Change Log ==
 
; Version 1.0.2, 2016-11-25:
:* Modified the text of the copyright notice slightly: Replaced "author" with "copyright holders and contributors".
 
; Version 1.0.1, 2016-11-07:
:* Resurrected Java 6 compatibility.
 
== License ==
 
<code>html2txt</code> is published under the "[[New BSD License]]".
 
== Contact ==
 
If you have issues, don't hesitate to [https://sourceforge.net/p/html2txt/tickets/ submit a ticket].
 
To discuss in public, check the [https://sourceforge.net/p/html2txt/discussion/ forum] and/or subscribe to it (envelope icon).

Latest revision as of 12:58, 19 January 2022

A tool to convert HTML documents into plain text.

For example this HTML code

Main.main.jpg

is rendered like this:

Usage.txt.jpg

For a complete description of the supported HTML inline elements, see here.

For a complete description of the supported HTML block elements, see here.

Motivation[edit]

The goal was to generate the "usage" page that a command line tool usually prints when you invoke it with a "-help" or "--help" option, rather than maintain it manually (e.g. in the form of "println()" statements in the code).

The chosen solution is to put a big DOC comment before the "main()" method, generate an HTML page with JAVADOC, convert that into a plain text file, put it into the application's JAR file and copy its contents to STDOUT when the user want to see it.

The command line version of html2txt itself uses that technique, and you can see the results above.

Download[edit]

You can download the latest version of the runnable JAR file here.

Limitations[edit]

Since the tool uses the JRE's built-in XML parser, it supports "numeric character references" (like "&#252;" for "Ü"), but not "named HTML character entity references" (like "&Uuml;" for "Ü").

For the same reason, the HTML markup in the DOC comments must be "well-formed", i.e. all start tags must be matched by an end tag (like "<li>...</li>"), and void tags must end with a slash, like "<br />".

Usage[edit]

Command line tool[edit]

see here.

ANT task[edit]

see here.

Library[edit]

see the JAVADOC.

Source Code[edit]

see the source code repository.

Change Log[edit]

Version 1.0.2, 2016-11-25
  • Modified the text of the copyright notice slightly: Replaced "author" with "copyright holders and contributors".
Version 1.0.1, 2016-11-07
  • Resurrected Java 6 compatibility.

License[edit]

html2txt is published under the "New BSD License".

Contact[edit]

If you have issues, don't hesitate to submit a ticket.

To discuss in public, check the forum and/or subscribe to it (envelope icon).