Downloading books from Project Gutenberg with Python

nlp
gutenberg
Published

January 18, 2024

Project Gutenberg is a great resource for free eBooks, and has lots of great classic texts for NLP. While there exist some libraries for accessing Project Gutenberg from Python such as py-gutenberg and GutenbergPy these require implicitly or explicitly building a database which makes them complex to use. The R package gutenberr is much easier to use because it distributes a snapshot of the catalog and loads it into memory, but I can’t find an equivalent in Python. So instead we’re going to directly search for books from Project Gutenberg’s CSV exports, and use them to download all the books of P. G. Wodehouse

import csv
from collections import Counter
from io import BytesIO
from pathlib import Path

import requests

Reading the Catalog

Project Gutenberg doesn’t have an API but has documentation on offline catalogs. There exists a large RDF catalog (around 100MB compressed) with detailed metadata, and a smaller CSV catalog (14MB uncompressed) that contains limited metadata.

The CSV catalog is small enough we can download it quickly into memory (note that requests automatically decompresses):

import requests
GUTENBERG_CSV_URL = "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz"

r = requests.get(GUTENBERG_CSV_URL)
csv_text = r.content.decode("utf-8")

f"Total size: {len(r.content) / 1024**2:0.2f}MB"
'Total size: 14.04MB'

The text is a standard CSV file:

print(csv_text[:400])
Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
1,Text,1971-12-01,The Declaration of Independence of the United States of America,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence",E201; JK,Politics; American Revolutionary War; United States Law
2,Text,1972-12-01,"The United States Bill o

An easy way to process it is with a DictReader, wrapping the text in StringIO to make it look like a file

import csv
from io import StringIO

next(csv.DictReader(StringIO(csv_text)))
{'Text#': '1',
 'Type': 'Text',
 'Issued': '1971-12-01',
 'Title': 'The Declaration of Independence of the United States of America',
 'Language': 'en',
 'Authors': 'Jefferson, Thomas, 1743-1826',
 'Subjects': 'United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence',
 'LoCC': 'E201; JK',
 'Bookshelves': 'Politics; American Revolutionary War; United States Law'}

We can then search for all P. G. Wodehouse books by looking for authors containing “Wodehouse”:

wodehouse_books = [book for book in csv.DictReader(StringIO(csv_text)) 
                   if 'Wodehouse' in book['Authors']]

len(wodehouse_books)
56

Let’s show our results in a HTML table (it’s a bit long - feel free to skim past it):

from IPython.display import display, HTML

def dicts_to_html_table(dicts):
    html = []
    header = None
    for d in dicts:
        if header is None:
            header = d.keys()
            html.append("<table><tr>" +
                        "".join([f"<th>{h}</th>" for h in header]) +
                        "</tr>")
        html.append("<tr>" +
                    "".join([f"<td>{d[h]}</td>" for h in header]) +
                    "</tr>")
    html.append("</table>")

    return "".join(html)

display(HTML(dicts_to_html_table(wodehouse_books)))
Text# Type Issued Title Language Authors Subjects LoCC Bookshelves
2005 Text 1999-12-01 Piccadilly Jim en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Piccadilly (London, England) -- Fiction PR Best Books Ever Listings; Humor
2042 Text 2000-01-01 Something New en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction PR Best Books Ever Listings; Humor
2233 Text 2000-06-01 A Damsel in Distress en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction PR Humor
2607 Text 2001-04-01 Psmith, Journalist en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
3756 Text 2008-06-25 Indiscretions of Archie en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; British -- United States -- Fiction; World War, 1914-1918 -- Veterans -- Fiction; Hotels -- Fiction; Married men -- Fiction; Fathers-in-law -- Fiction PR Humor
3829 Text 2003-03-01 Love Among the Chickens en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction PR Humor
4075 Text 2003-05-01 The Intrusion of Jimmy en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; Love stories; Burglary -- Fiction; British -- United States -- Fiction; Police -- Family relationships -- Fiction PR Humor
6683 Text 2004-10-01 The Little Nugget en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Kidnapping -- Fiction PR Humor
6684 Text 2004-10-01 Uneasy Money en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; Inheritance and succession -- Fiction; Love stories; Aristocracy (Social class) -- Fiction; British -- United States -- Fiction PR Humor
6753 Text 2004-10-01 Psmith in the City en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR; PZ Humor
6768 Text 2004-10-01 The Man Upstairs and Other Stories en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English PR Humor
6836 Text 2004-11-01 Three Men and a Maid en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
6837 Text 2004-11-01 The Little Warrior en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Love stories; Poor women -- Fiction; Musicals -- Fiction; Broadway (New York, N.Y.) -- Fiction; Long Island (N.Y.) -- Fiction PR Humor
6877 Text 2004-11-01 The Head of Kay's en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction PR; PZ Humor; School Stories
6879 Text 2004-11-01 The Gold Bat en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English; Boys -- Conduct of life -- Juvenile fiction; Schools -- Juvenile fiction; Sports -- Juvenile fiction PR; PZ Humor; School Stories
6880 Text 2004-11-01 The Coming of Bill en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
6927 Text 2004-11-01 The White Feather en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction; Children's stories PR; PZ Humor; School Stories
6955 Text 2004-11-01 The Prince and Betty en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
6980 Text 2004-11-01 Tales of St. Austin's en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction; Humorous stories PR; PZ Humor; School Stories
6984 Text 2004-11-01 The Pothunters en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction; Humorous stories; Theft -- Fiction; England -- Social life and customs -- 20th century -- Fiction PR; PZ Humor; School Stories
6985 Text 2004-11-01 A Prefect's Uncle en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction; Humorous stories; Uncles -- Fiction; Schoolboys -- Fiction; Cricket stories PR; PZ Humor; School Stories
7028 Text 2004-12-01 The Clicking of Cuthbert en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Golf stories PR Humor
7050 Text 2004-12-01 The Swoop! or, How Clarence Saved England: A Tale of the Great Invasion en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Satire; Humorous stories; England -- Fiction; Boy Scouts -- Fiction; Imaginary wars and battles -- Fiction PR Humor; Scouts
7230 Text 2005-01-01 Not George Washington — an Autobiographical Novel en Westbrook, H. W. (Herbert Wetton); Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
7298 Text 2005-01-01 William Tell Told Again en Wodehouse, P. G. (Pelham Grenville), 1881-1975 English wit and humor; Tell, Wilhelm -- Fiction PR; PZ Humor
7423 Text 2005-02-01 Mike en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Boarding schools -- Fiction; Schools -- Fiction; Humorous stories; England -- Fiction; Cricket -- Fiction PR; PZ Humor; School Stories
7464 Text 2005-02-01 The Adventures of Sally en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; Inheritance and succession -- Fiction PR Humor
7471 Text 2005-02-01 The Man with Two Left Feet, and Other Stories en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Short stories; Humorous stories, English PR Humor
8164 Text 2005-05-01 My Man Jeeves en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction PR Humor
8176 Text 2005-05-01 Death at the Excelsior, and Other Stories en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Valets -- Fiction PR Humor
8178 Text 2005-05-01 The Politeness of Princes, and Other School Stories en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Schools -- Fiction PR; PZ Humor; School Stories
8190 Text 2005-05-01 A Wodehouse Miscellany: Articles & Stories en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Valets -- Fiction; England -- Social life and customs -- Fiction PR Humor
8713 Text 2005-08-01 A Man of Means en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR Humor
8931 Text 2005-09-01 The Gem Collector en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Jewel thieves -- Fiction PR Humor
10554 Text 2004-01-01 Right Ho, Jeeves en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction PR Humor
10586 Text 2004-01-01 Mike and Psmith en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Boarding schools -- Fiction; Schools -- Fiction; Humorous stories; England -- Fiction; Cricket -- Fiction PZ Humor; School Stories
20532 Text 2007-02-06 Love Among the Chickens A Story of the Haps and Mishaps on an English Chicken Farm en Wodehouse, P. G. (Pelham Grenville), 1881-1975; Both, Armand, 1881-1922 [Illustrator] Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction PR Humor
20533 Text 2007-02-06 Jill the Reckless en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Love stories; Poor women -- Fiction; Musicals -- Fiction; Broadway (New York, N.Y.) -- Fiction; Long Island (N.Y.) -- Fiction PR Humor
20717 Text 2007-03-01 The Girl on the Boat en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Children of the rich -- Fiction; Golf stories; Man-woman relationships -- Fiction PR
23899 Sound 2007-12-01 Psmith in the City en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories PR; PZ
26303 Sound 2008-08-01 Right Ho, Jeeves en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction PR
26579 Sound 2008-09-01 Love Among the Chickens en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction PR
43317 Text 2013-07-26 Lord Lyons: A Record of British Diplomacy, Vol. 1 of 2 en Newton, Thomas Wodehouse Legh, Baron, 1857-1942 Europe -- Politics and government -- 1871-1918; Lyons, Richard Bickerton Pemell Lyons, Earl, 1817-1887; Diplomatic and consular service, British; Great Britain -- Foreign relations -- 1837-1901 DA
44143 Text 2013-11-10 Lord Lyons: A Record of British Diplomacy, Vol. 2 of 2 en Newton, Thomas Wodehouse Legh, Baron, 1857-1942; Ward, Wilfrid, Mrs., 1864-1932 [Contributor] Europe -- Politics and government -- 1871-1918; Lyons, Richard Bickerton Pemell Lyons, Earl, 1817-1887; Diplomatic and consular service, British; Great Britain -- Foreign relations -- 1837-1901 DA
58508 Text 2018-12-21 Index of the Project Gutenberg Works of Pelham Grenville Wodehouse en Wodehouse, P. G. (Pelham Grenville), 1881-1975; Widger, David, 1932-2021? [Editor] Indexes PR
59254 Text 2019-04-11 The Inimitable Jeeves en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction; England -- Social life and customs -- Fiction; Upper class -- England -- Fiction PR
60067 Text 2019-08-06 Leave it to Psmith en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Impostors and imposture -- Fiction; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction; Jewel thieves -- Fiction PR
61507 Text 2020-02-25 Ukridge en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction PR
63735 Text 2020-11-13 Subscription the disgrace of the English Church [2nd edition] en Wodehouse, C. N. (Charles Nourse), 1790-1870 Church of England -- Controversial literature; Church of England. Thirty-nine Articles BX
63738 Text 2020-11-13 Subscription the disgrace of the English Church [1st edition] en Wodehouse, C. N. (Charles Nourse), 1790-1870 Church of England -- Controversial literature; Church of England. Thirty-nine Articles BX
65172 Text 2021-04-26 A Gentleman of Leisure en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; Love stories; Burglary -- Fiction; British -- United States -- Fiction; Police -- Family relationships -- Fiction PR
65974 Text 2021-08-01 Carry On, Jeeves en Wodehouse, P. G. (Pelham Grenville), 1881-1975 British -- New York (State) -- New York -- Fiction; Short stories; Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction PR
67368 Text 2022-02-10 Sam in the Suburbs en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories; England -- Fiction; Gangsters -- Fiction; Publishers and publishing -- Fiction PR
70041 Text 2023-02-14 The small bachelor en Wodehouse, P. G. (Pelham Grenville), 1881-1975 New York (N.Y.) -- Fiction; Humorous stories; Man-woman relationships -- Fiction; Upper class -- Fiction; Painters -- Fiction PR
70222 Text 2023-03-06 Meet Mr Mulliner en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English; England -- Fiction; Short stories, English; San Francisco (Calif.) -- Fiction; Interpersonal relations -- Fiction; Mulliner family (Fictitious characters) -- Fiction PR
72227 Text 2023-11-25 Divots en Wodehouse, P. G. (Pelham Grenville), 1881-1975 Humorous stories, English; Golfers -- Fiction; Golf stories, English PR

There are a couple of results above that aren’t what I am looking for:

  • Other authors with the name Wodehouse: Wodehouse, C. N. and Thomas Wodehouse Legh
  • The “Index of the Project Gutenberg Works of Pelham Grenville Wodehouse”
  • Some of them are “Sound” not text

We can filter these out to get just the books we need.

wodehouse_books = [b for b in wodehouse_books
                   if "Wodehouse, P. G." in b["Authors"]
                   and "Indexes" not in b["Subjects"]
                   and b["Type"] == "Text"]
len(wodehouse_books)
48

Downloading the text

Once we have the id of the book (Text#), it can be downloaded from a standard URL. For human access we can get them from https://www.gutenberg.org/ebooks/{id}.txt.utf-8:

GUTENBERG_TEXT_URL = "https://www.gutenberg.org/ebooks/{id}.txt.utf-8"

book_id = wodehouse_books[0]["Text#"]

#r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
#text = r.text

but their robots access policy suggests using a special URL to get links (we set the filetypes to txt here to get text).

GUTENBERG_ROBOT_URL = "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
r = requests.get(GUTENBERG_ROBOT_URL)

print(r.text[:750])
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>All Files (offset: 0, filetypes: txt) - Project Gutenberg</title>
  </head>
  <body>
    <h1>All Files (offset: 0, filetypes: txt)</h1>    <p><a href="http://aleph.gutenberg.org/etext02/comed10.zip">http://aleph.gutenberg.org/etext02/comed10.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip</a></p>

    <p><a href="http://aleph.guten

The mirror can be extracted from the URLs:

import re

GUTENBERG_MIRROR = re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)
GUTENBERG_MIRROR
'http://aleph.gutenberg.org'

Then we can construct the URL using the same logic as gutenbergr. Note that sometimes we need to add a suffix (e.g. look at http://aleph.gutenberg.org/0/1/ which only has a -0)

def gutenberg_text_urls(id: str, mirror=GUTENBERG_MIRROR, suffixes=("", "-8", "-0")) -> list[str]:
    path = "/".join(id[:-1]) or "0"
    return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]

gutenberg_text_urls(book_id)
['http://aleph.gutenberg.org/2/0/0/2005/2005.zip',
 'http://aleph.gutenberg.org/2/0/0/2005/2005-8.zip',
 'http://aleph.gutenberg.org/2/0/0/2005/2005-0.zip']

We can then try each URL in turn until we find the file, and then unzip it:

import logging
import zipfile

def download_gutenberg(id: str) -> str:
    for url in gutenberg_text_urls(id):
        r = requests.get(url)
        if r.status_code == 404:
            logging.warning(f"404 for {url}")
            continue
        r.raise_for_status()
        break
    
    z = zipfile.ZipFile(BytesIO(r.content))
    
    if len(z.namelist()) != 1:
        raise Exception(f"Expected 1 file in {z.namelist()}")
        
    return z.read(z.namelist()[0]).decode('utf-8')
text = download_gutenberg(book_id)

print(text[:1500])
The Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Piccadilly Jim

Author: P. G. Wodehouse

Release Date: September 12, 2012 [EBook #2005]
Last Updated: August 16, 2016

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***




Produced by Jim Tinsley









Piccadilly Jim


by

Pelham Grenville Wodehouse





CHAPTER I

A RED-HAIRED GIRL

The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one of the leading eyesores of that breezy and
expensive boulevard. As you pass by in your limousine, or while
enjoying ten cents worth of fresh air on top of a green omnibus,
it jumps out and bites at you. Architects, confronted with it,
reel and throw up their hands defensively, and even the lay
observer has a sense of shock. The place resembles in almost
equal proportions a cathedral, a suburban villa, a hotel and a
Chinese pagoda. Many of its windows are of stained glass, and
above the porch stand two terra-cotta lions, considerably more
repulsive even than the complacent animals which guard New York's
Public Library. It is a house which is impossible to overlook:
and it

Searching for this text we can see it also appears near the end of the text (actually this book has some transcriber’s notes after the end of the text, but we’ll leave them in)

GUTENBERG_TEXT = "PROJECT GUTENBERG EBOOK "

lines = text.splitlines()

first = True
for idx, line in enumerate(lines):
    if GUTENBERG_TEXT in line:
        if first:
            first = False
            continue
        print('=' * 80)
        print('\n'.join(lines[idx-20:idx+20]))
        print('=' * 80)
        print()
================================================================================

This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:

 "Before his stony eye the immaculate Bartling wilted. All that
 he had ever heard and read about doubles came to him."

--------------------------------










End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse

*** END OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***

***** This file should be named 2005.txt or 2005.zip *****
This and all associated files of various formats will be found in:
        http://www.gutenberg.org/2/0/0/2005/

Produced by Jim Tinsley

Updated editions will replace the previous one--the old editions
will be renamed.

Creating the works from public domain print editions means that no
one owns a United States copyright in these works, so the Foundation
(and you!) can copy and distribute it in the United States without
permission and without paying copyright royalties.  Special rules,
set forth in the General Terms of Use part of this license, apply to
copying and distributing Project Gutenberg-tm electronic works to
protect the PROJECT GUTENBERG-tm concept and trademark.  Project
Gutenberg is a registered trademark, and may not be used if you
charge for the eBooks, unless you receive specific permission.  If you
================================================================================

We can read everything between the first and last header with a simple state machine:

def strip_headers(text):
    in_text = False
    output = []
    
    for line in text.splitlines():        
        if GUTENBERG_TEXT in line:
            if not in_text:
                in_text = True
            else:
                break
        else:
            if in_text:
                output.append(line)

    return "\n".join(output).strip()

stripped_text = strip_headers(text)

And check that they have worked:

print(stripped_text[:200])
print("*" * 80)
print(stripped_text[-500:])
Produced by Jim Tinsley









Piccadilly Jim


by

Pelham Grenville Wodehouse





CHAPTER I

A RED-HAIRED GIRL

The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one
********************************************************************************
ling wilted.
 It was a perfectly astounding likeness, but it was
 apparent to him when what he had ever heard and read
 about doubles came to him."

This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:

 "Before his stony eye the immaculate Bartling wilted. All that
 he had ever heard and read about doubles came to him."

--------------------------------










End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse

Downloading all the files

Now we can download all the files in a simple loop; let’s create a simple function that gets and cleans the text:

def book_text(book_id):
    r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
    text = r.text
    clean_text = strip_headers(text)
    return clean_text

We’ll save each book into the “data” folder

data_path = Path("data")
data_path.mkdir(exist_ok=True)

And finally save all the books (one at a time to not overload the server):

for book in wodehouse_books:
    id = book["Text#"]
    text = book_text(id)
    print(f"Saving {book['Title']} by {book['Authors']} containing {len(text):_} characters")
    with open(data_path / (id + ".txt"), "wt") as f:
        f.write(text)
Saving Piccadilly Jim by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 449_842 characters
Saving Something New by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 419_221 characters
Saving A Damsel in Distress by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 429_025 characters
Saving Psmith, Journalist by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 322_135 characters
Saving Indiscretions of Archie by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 413_041 characters
Saving Love Among the Chickens by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 278_160 characters
Saving The Intrusion of Jimmy by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 381_406 characters
Saving The Little Nugget by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 388_673 characters
Saving Uneasy Money by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 364_858 characters
Saving Psmith in the City by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 295_582 characters
Saving The Man Upstairs and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 562_134 characters
Saving Three Men and a Maid by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 307_343 characters
Saving The Little Warrior by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 650_963 characters
Saving The Head of Kay's by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 245_760 characters
Saving The Gold Bat by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 240_786 characters
Saving The Coming of Bill by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 466_145 characters
Saving The White Feather by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 249_059 characters
Saving The Prince and Betty by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 379_548 characters
Saving Tales of St. Austin's by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 283_438 characters
Saving The Pothunters by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 228_580 characters
Saving A Prefect's Uncle by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 236_385 characters
Saving The Clicking of Cuthbert by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 341_920 characters
Saving The Swoop! or, How Clarence Saved England: A Tale of the Great Invasion by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 102_301 characters
Saving Not George Washington — an Autobiographical Novel by Westbrook, H. W. (Herbert Wetton); Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 289_260 characters
Saving William Tell Told Again by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 83_002 characters
Saving Mike by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 596_796 characters
Saving The Adventures of Sally by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 433_963 characters
Saving The Man with Two Left Feet, and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 383_878 characters
Saving My Man Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 272_292 characters
Saving Death at the Excelsior, and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 208_157 characters
Saving The Politeness of Princes, and Other School Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 131_705 characters
Saving A Wodehouse Miscellany: Articles & Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 164_155 characters
Saving A Man of Means by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 154_068 characters
Saving The Gem Collector by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 180_113 characters
Saving Right Ho, Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 405_146 characters
Saving Mike and Psmith by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 301_734 characters
Saving Love Among the Chickens
A Story of the Haps and Mishaps on an English Chicken Farm by Wodehouse, P. G. (Pelham Grenville), 1881-1975; Both, Armand, 1881-1922 [Illustrator] containing 272_012 characters
Saving Jill the Reckless by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 633_811 characters
Saving The Girl on the Boat by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 379_578 characters
Saving The Inimitable Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 374_608 characters
Saving Leave it to Psmith by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 501_634 characters
Saving Ukridge by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 423_246 characters
Saving A Gentleman of Leisure by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 382_804 characters
Saving Carry On, Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 406_254 characters
Saving Sam in the Suburbs by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 448_006 characters
Saving The small bachelor by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 413_161 characters
Saving Meet Mr Mulliner by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 302_837 characters
Saving Divots by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 377_621 characters

We can also save our metadata for future reference:

with open(data_path / 'metadata.csv', 'wt') as f:
    csv_writer = csv.DictWriter(f, fieldnames=wodehouse_books[0].keys())
    csv_writer.writeheader()
    for book in wodehouse_books:
        csv_writer.writerow(book)

Conclusion

It’s really simple to search for books using the Project Gutenberg CSV catalog, and to download the books in a way that complies with their robots and crawlers guidelines (thanks to gutenbergr for showing the way). You can easily get books from Project Gutenberg for further data analysis or machine learning; I’m going to train a language model on P. G. Wodehouse.