I do loads of studying on my pill day-after-day. While I’ve purchased just a few eBooks, I get pleasure from discovering issues without spending a dime on Project Gutenberg; it rekindles fond recollections of shopping by the stacks of a library for one thing to catch my curiosity. There are varied methods to look the PG web site by title or creator, however this presumes you’ve gotten some thought of what you’re searching for.
I’ve used the Magic Catalog, however I appear to have seen or learn each e book listed there that pursuits me, and so far as I can inform the catalog is about ten years previous. In 2017 alone, PG added 2,423 books to its catalog, so maybe 20,000 have been added over the past ten years.
From the Project Gutenberg web site, you possibly can hyperlink to the Offline Catalogs and obtain a plain-text record of all of the books freely obtainable, however the file is 6.6 MB—just a little unwieldy. Even the record for 2017 solely is a bit tedious to scan. So I made a decision to make my very own internet web page from this record, together with hyperlinks to every e book (just like the Magic Catalog), and switch that into an eBook. This turned out to be simpler than you would possibly count on. The trick is to make use of regex
; particularly, regex
as featured in Kwrite.
First, strip out the preamble textual content, which explains varied particulars about Project Gutenberg. The itemizing begins after that:
~ ~ ~ ~ Posting Dates for the under eBooks: 1 Dec 2017 to 31 Dec 2017 ~ ~ ~ ~TITLE and AUTHOR ETEXT NO.
The Origin and Development of Christian Dogma, by Charles A. H. Tuthill 56279
[Subtitle: An essay in the science of history]Frank Merriwell's Endurance, by Burt L. Standish 56278
[Subtitle: or A Square Shooter]Derelicts, by James Sprunt 56277
[Subtitle: An Account of Ships Lost at Sea in General Commercial
Traffic and a Brief History of Blockade Runners Stranded Along
the North Carolina Coast 1861-1865]Comical Pilgrim; or, Travels of a Cynick Philosopher..., by Anonymous 56276
[Subtitle: Thro' essentially the most Wicked Parts of the World, Namely,
England, Wales, Scotland, Ireland, and Holland]I'r Aifft Ac Yn Ol, by D. Rhagfyr Jones 56275
[Language: Welsh]
This reveals the construction of the textual content file. The 5-digit quantity is the search time period for every e book—for instance, the primary e book can be discovered right here: http://www.gutenberg.org/ebooks/56279. Each e book is separated from the subsequent by an empty line.
To begin, obtain the file GUTINDEX.2017
, load it into Kwrite, strip off the preamble, and Save As GUTINDEX.2017.xhtml
, so the unique is unedited simply in case. You would possibly as nicely put within the xhtml
preamble:
/htmlfragment>
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>IntestineIndex 2017</title>
</head>
<physique>
Then on the backside of the file:
I’m not a fan of the ~ ~ ~ ~
(4 tildes separated by three areas), so choose Edit > Replace in Kwrite to deliver up the Replace dialog on the backside. You don’t want to pick out Regular expression because the Mode, however you’ll want it later, so go forward and try this.
In Find, enter ~ ~ ~ ~
and nothing in Replace. Click Replace All, they usually all disappear, with the message: 24 replacements.
Now let’s make the hyperlinks. In Find, enter: (ddddd)
. (You should embody the parentheses.)
In Replace, enter: <a href=”http://www.gutenberg.org/ebooks/1”>1</a>
This searches for a sequence of 5 digits and replaces it with the hyperlink data, which incorporates the actual 5-digit quantity twice, denoted by 1
. Now summon the braveness to click on Replace All (keep in mind which you could undo this if you happen to’ve made a mistake), and the magic occurs: 2423 replacements. Here’s a fraction:
The Origin and Development of Christian Dogma, by Charles A. H. Tuthill 56279
[Subtitle: An essay in the science of history]Frank Merriwell's Endurance, by Burt L. Standish <a href="http://www.gutenberg.org/ebooks/56278">56278</a>
[Subtitle: or A Square Shooter]Derelicts, by James Sprunt <a href="http://www.gutenberg.org/ebooks/56277">56277</a>
[Subtitle: An Account of Ships Lost at Sea in General Commercial
Traffic and a Brief History of Blockade Runners Stranded Along
the North Carolina Coast 1861-1865]
Witness the facility of regex
! Now let’s create paragraphs to separate these particular person books as whitespace and newlines imply nothing to HTML. Here is the place we use that vacant line between books. Before we try this, although, let’s get rid of the traces that comprise headings:
TITLE and AUTHOR ETEXT NO.
We’re doing this as a result of they’re pointless, and the second heading isn’t going to line up with the e book quantity anyway. I wished to eliminate this line and the additional newline characters, and since there have been solely 12, I went by the file manually—however you possibly can facilitate this through the use of Edit > Find, trying to find ETEXT.
Now extra regex
. In Find, enter: nn
In Replace, enter: </p>nn<p>
Then Replace All. I depart within the two newline characters so the textual content file is less complicated to learn. You might want to manually add </p>
on the finish of the record. At the start, you may see this:
Posting Dates for the under eBooks: 1 Dec 2017 to 31 Dec 2017<p>The Origin and Development of Christian Dogma, by Charles A. H. Tuthill <a href="http://www.gutenberg.org/ebooks/56279">56279</a>
I’d wish to make the posting dates a header, however I additionally need to get rid of Posting Dates for the under eBooks:
since merely exhibiting the dates is sufficient. In Find, enter: Posting Dates for the under eBooks:
, and in Replace, enter: <h3>
(or <h4>
).
Now let’s repair that trailing </p>
for every header. You might do that manually, however if you happen to’re feeling lazy, enter 2017 </p>
in Find, and </h3>
in Replace. With every of those, there is a slight danger of doing an excessive amount of, however the suggestions will let you know what number of replacements there are (there must be 12). And you all the time have Undo.
Now for some handbook cleanup. Because you added the <p>
and </p>
tags, and due to the brand new <h3>
tags, there might be additional paragraph tags and a mismatch within the area of those headers. You might merely scan the file at these factors, or get some assist by getting into <h3>
within the Find house, clicking Find All to focus on them, and scrolling down the file to eliminate any unneeded tags.
The different downside I discovered with XHTML was ampersands scattered all through. Since XHTML is stricter than HTML, substitute the &
with &
. You could need to substitute these individually utilizing Replace as an alternative of Replace All.
Some of the traces within the textual content file have some kind of management character that acts like
(a non-breaking house). To repair this, spotlight one in Kwrite—they present up as a faint baseline with a vertical bump—paste it into Find, and enter an area in Replace. This maintains visible spacing as textual content however is ignored as HTML (by the way in which, there have been 12,586 of those within the doc).
Here’s the way it seems to be in a narrowed browser window:
Clicking a hyperlink takes you to the e book’s Project Gutenberg web page, the place you possibly can view or obtain it.
I used Sigil to transform this to an eBook, which was most likely the best a part of the method. Start Sigil, then choose “Add Existing Files” from the toolbar and choose your XHTML or HTML file. To create a chapter for every month, scroll right down to the month-to-month header line, place the cursor firstly of the road, then Split at Cursor (Ctrl + Return) to create 12 chapters. You may use the headers to create a desk of contents; it’s additionally a good suggestion to edit the metadata to offer it a title that may present up in your eBook reader (you can also make your self the creator). Finally, save the file, and also you’re carried out.
Happy studying!