2011-01-03 : In 2002, I wrote a little JavaScript called
HTMLify that wold convert plain text into
nicely formatted HTML. Since then, there are a lot of improvements I've wanted to make,
but a plain JavaScript program is not really up to the task. So I'm goint to promote it to
full Java and see what I can do.
Ideas
Input
- (done) Specify input character set
- Handle RTF input
- (done) Handle HTML input
- (done) Read configuration file so that manual adjustments can be kept
- (done) input format and character set
- (done) specific paragraph breaks and other fixes
- (done) regions to exclude (e.g., 3rd party headers)
- (done) Specify algorithm parameters for specific blocks (e.g., paras every line
in this region)
(done) Idea for specifying blocks: Allow regexp for specifying markers. Marker tokens are inserted
into token stream at beginning of match. Each regexp gets a group name ("A"), and then each
instance gets an instance number ("1"). This gives each marker a unique name ("A1", "A5", "B2").
For example, a default set of markers might be placed before every newline.
Ranges can then be specifed from marker to marker. Mixing marker groups is allowed, and ranges
are in standard half-open format.
Output
- (done) HTML: set the width and metadata tags correctly for iPhone so it doesn't have extra
space on the side.
- (done) HTML: write the headers at the beginning of the file rather than at the end
- (done) EPUB: Write to EPUB files for use
with book readers.
(tutorial)
- XML: Write to XML file so that HTML and EPUB files can be generated on the fly from that.
Get rid of HTML specific formatting (e.g., manual paragraph indent). Add content level tags
(e.g., title, chapter, author).
- Multi-file output: Break up single large file into smaller files for faster loading
- Generate on the fly from XML?
- Autoload next segment?
- Still want single file version available for offline viewing
2011-02-08 : Well, at this point, I have a Java program that can process plain text files and
produce the same results as the original JavaScript. Now it's time to start doing some
improvements!
2011-06-03 : I've put in a bunch of work recently.
It now produces very nice looking HTML files.
The configuration scripts are a bit peculiar, but I'm quite happy with the results. It fixes
ellipses and quotes. It supports epigraphs and hierarchical sections (chapters, etc.). You can
do arbitrary search and replace to fix typos and things. You can control paragraph detection and
use different settings in different regions.
It would probably be nice to create a manual. :)
Next stop, EPUB!
2011-06-09 : It now produces nice looking EPUB files. However, it takes quite a bit of
fiddling with the script to get all the headers and titles to look good.
Still, I like the results. Now I'm spending more time building config scripts than tweaking the
code. :)
One challenge is producing mark up that looks good in all viewers. My original plan was to
keep the HTML as simple as possible. However, getting spacing and line breaks right is starting
to add lots of fiddly mark up. Sigh. I'll do my best to let it go.
2011-10-31 : Preliminary HTML source support added. It works well for the simple files
I'm converting initially. It'll probably need more work in the long run.
The config file language is particularly bizarre. It's about time to refactor it into
something better. Unfortunately, I don't really have any ideas for what a better language
would look like.
Also, a little perf work would be good. I suspect it's the regexp engine but I need to
profile to find out.
2011-11-10 : Lots of good fixes. There's MOBI (Kindle) support. I added some syntactic
sugar and clarified how the boundaries of Ranges work, so they're now much easier to use.
I've had some problems with regexps moving markers around. I need a way to indicate
that something needs to be matched but is not going to be replaced.
I'm also worried about stack depth. I'm wondering if I can switch from a "recursive"
approach to an "iterative" approach. It would be nice if that allowed me to run stages
of the pipeline concurrently too.
2011-11-29 : Cover art! Much awesomeness. Automatically lays out a cover with title,
authors, and optional artwork, automatically adjusting font size to fit.
I based the background on
blue-curves.jpg from
icantgetpublished.com.
Doing a Google image search for
"blue background" is
rather fun!
Config File Reference
System Filters - These filters will generally go in a file-type config file, not a
project-specific file.
- AddMissingBeginTokens - Adds section, paragraph, and sentence tokens that are missing.
- BlankLineToParagraph - Converts completely blank lines to paragraph tokens.
- CanonicalizeEllipses - Canonicalizes the white space before, after, and within all sequences
of three or more periods to have a single space before and after each period. Matches
this style, which I
prefer.
- CanonicalizeWhiteSpace - Converts all runs of white space characters to single spaces.
- ConvertHtml - Converts HTML tag tokens to native formatting tokens.
- IdentifyHtml - Identifies HTML tags in a character stream.
- IdentifySentences - Inserts sentence tokens between sentences within paragraphs.
- LeadingWhiteSpaceLineToParagraph - Prepends a paragraph token to each line that starts with
white space (i.e., is indented).
- LineToParagraph - Prepends a paragraph token to each and every line (i.e., each line is a
paragraph).
- LineToSection - Converts lines that are mostly "graphics" to section tokens.
- MarkLines - Looks for CR-LF runs and inserts the appropriate number of line markers.
- NoOp - Does nothing. Useful for establishing a named position in the processing pipeline.
- RemoveEmptyParagraphs - Removes empty paragraphs from the stream.
- RemoveEmptySections - Removes empty sections from the stream.
- SmartQuotes - Converts regular quotes to smart quotes.
- WriteEPub FileName="" - Writes an EPUB file.
- WriteMobi FileName="" - Writes a MOBI (Kindle) file.
- WriteOfflineHtml FileName="" - Writes a single HTML file suitable for offline viewing.
- WriteSystemOutFilter - Dumps all tokens to System.out. Useful for debugging.
User Filters - These filters will generally go in your project-specific file.
- AllToTitleCase - Converts all letter to lower cases and converts the first letter of the
first word of a phrase and all subsequent words that are not articles, prepositions, or
conjunctions to upper case.
- AppendCoverArt FileName="" ArtistName="" [ProvenanceText=""] - Appends a reference to
cover art to the token stream when it sees END. The FileName is the name of the image to
place in the center of the cover. The artist name will be added as metadata
(dc:name="creator", opf:role="art"). If given, the ProvenanceText will be
included as text in the output.
Only one cover art image is supported; the subsequent cover art tokens will replace earlier
ones.
- AppendFileCharacters FileName="" Encoding="" - Appends a file to the token stream when
it sees END. Encoding must be a Java character encoding name.
- AppendMarker Group="" Id="" - Appends a marker token to the token stream when it sees END.
Group is the group name, which should not contain spaces. Id must be an integer.
- AppendMetadata DcName="" Value="" [AttributeName="" AttributeValue=""] [Visible=""]
[Replace=""] - Appends a metadata token to the token stream
when it sees END. DcName is the name of the attribute, as defined in the
Dublin Core Metadata Initiative.
See also OPF
Publication Metadata. An additional attribute (such as "opf:role") can be added
using AttributeName and AttributeValue.
Visible is a boolean; if true, the value will be included as text in the output; defaults to
false. Replace is a boolean; if true, the value will replace previous values for this name
rather than adding to the list; defaults to false.
- AppendParagraph - Appends a paragraph token to the token stream when it sees END.
- AppendSection SectionLevel="" [SectionName=""] - Appends a section token to the token
stream when it sees END. SectionName is optional; if not specified, the section will be unnamed.
SectionLevel is an integer indicating the number of indents, starting at 0.
- AppendText Text="" - Appends character tokens to the token stream when it sees END.
- Delete - Deletes all tokens (except markers).
- MarkRegexp RegExp="" [BeforeGroup=""] [AfterGroup=""] [FailIfNotFound="true"] - Searches for
strings that match a regular expression and inserts a mark before or after the match.
BeforeGroup is an optional group name; if given, it is used to mark where the match starts.
AfterGroup is an optional group name; if given, it is used to mark where the match ends.
At least one of BeforeGroup or AfterGroup must be specified.
FailIfNotFound is an optional boolean that defaults to true; if true and the region never occurs
in the stream, processing is aborted with an error. This is helpful for detecting config file
bugs.
- NumberUnnamedSections [StartCount="1"] SectionLevel="" SectionName="" - Assigns names and
potentially numbers to all unnamed sections in the token stream. StartCount is an optional
integer and defaults to 1; the section counter for assigned names will start with this number.
SectionLevel is an integer indicating the number of indents, starting at 0.
SectionName is a Java format string used to name the sections. Use "%d" to insert the current
counter value. Note that an unnamed section will get a default name by the output writers, but
a section with the name "" will be written as a small section break with no TOC entry.
- Region [FromMarker=""] [UntilMarker=""] [FailIfNotFound="true"] - Processes a region within
a stream with a different set of filters. Markers must be given in "group id" format. FromMarker
is optional and indicates the beginning of the region, inclusive; if not specified, the region
starts at the beginning of the stream. UntilMarker is optional and indicates the end of the
region, exclusive; if not specified, the region continues until the end of the stream.
FailIfNotFound is an optional boolean that defaults to true; if true and the region never occurs
in the stream, processing is aborted with an error. This is helpful for detecting config file
bugs.
- StringReplace RegExp="" Replacement="" [FailIfNotFound="true"] - Performs a running search
and replace. RegExp and Replacement are similar to standard Java regular expressions.
FailIfNotFound is an optional boolean that defaults to true; if true and the region never occurs
in the stream, processing is aborted with an error. This is helpful for detecting config file
bugs.
- WrapEpigraph - Wraps content in
epigraph tokens.
- WrapFrontMatter - Wraps content in front matter tokens. Front matter comes before the TOC and
is not listed in the TOC.
- WrapSectionName SectionLevel="" - Converts its content into a named section.
SectionLevel is an integer indicating the number of indents, starting at 0.
Other Notes
Kindle width is 520. Kindle DX width is 745.
Calibre generated cover size is 590x740.
Nominal font size: Title: 48 Author: 32
The Result
Source:
https://latenighthacking.com:8443/svn/code_root/2011/TextMassager
C o m m e n t s :
(nothing yet)