pgextract - new app
#1
Just a little CLI app to grab the book title and author from a Project Gutenberg UTF-8 text file and write them to attributes of the file. By itself this won't be of much interest, but it is part of an ebook reader project of mine.

Usage:
pgextract_en path/to/file.txt
asks for confirmation in the Terminal before writing the attributes

pgextract_en --noconfirm path/to/file.txt
Skips confirmation. For use in batch conversions. This app will only accept a single filename, but you can use it inside a for loop in a shell script.

pgextract_en --confirmGUI path/to/file.txt
Puts the confirmation process in a graphical Alert. Haven't quite figured what that would be good for yet, but who knows?

pgextract_en --help OR pgextract_en -h
shows help

path/to/file.txt cannot contain spaces. Maybe in the next version, but Project Gutenberg files have names like pg12345.txt anyway.

I discovered that PG files have some nasty embedded codes in the beginning of the file, otherwise more straightforward approaches would have been possible. This code does require some clean-up - too many exit points, for one thing.

Code:
#!/bin/env yab

doc pgextract_en v0.1
doc Extract author and title data from a Project Gutenberg text file,
doc and write these to attributes.
doc (c) Michel Clasquin-Johnson, 2016, Public Domain
doc
doc Usage:
doc   pgextract_en <--noconfirm> <--confirmGUI> <path/to/file>
doc
doc The default behaviour is to ask for confirmation in text mode before
doc writing the attributes. The --noconfirm switch skips this step. The
doc --confirmGUI switch puts the confirmation in a Haiku two-button alert.
doc These switches are INCOMPATIBLE! All switches are case-insensitive.
doc
doc Pathnames should NOT contain spaces. One file at a time, please!
doc
doc This will only work with English-language files, since it searches for
doc the strings  "The Project Gutenberg EBook of " and ", by". I may write
doc versions for other languages if necessary.
doc

fulltitle$=""
title$=""
author$=""
noconfirm =0
thefile$ = peek$("argument")
if lower$(thefile$) = "--help" or lower$(thefile$) = "-h" showhelp()
if lower$(thefile$) = "--noconfirm" then
    noconfirm =1
    thefile$ = peek$("argument")
elseif lower$(thefile$) = "--confirmgui" then
    noconfirm =-1
    thefile$ = peek$("argument")
endif
if thefile$ = "" exit
firstline$ = system$("head -n 1 " + thefile$)
print "Processing " + thefile$
print "First line: "
print firstline$
print "Parsing ..."
parse()
switch noconfirm
    case -1    //GUI confirmation
        a$ = "Full Title: " + fulltitle$ + ".\n"
        a$ = a$ + "Title: " + title$ + ".\n"
        a$ = a$ + "Author: " + author$ + ".\n\n"
        a = ALERT a$ + "Write these attributes to " + thefile$ + "?", "Yes", "", "No", "warning"
        if a = 1 writeattribs()
    break
    case 0    //CLI confirmation
        print "Full entry: " + fulltitle$
        print "Title: " + title$
        print "Author: " + author$
        input "Write these attributes to the file? (y/n) " a$
        if lower$(left$(a$,1)) = "y" writeattribs()
    break
    case 1    // no confirmation - for automated bulk operations
            //requires the --noconfirm switch
        writeattribs()
    break
    default
    break
end switch

exit

sub writeattribs()
    print
    print "Setting attribute ebook:full_title to " + fulltitle$ + "."
    attribute set "String", "ebook:full_title", fulltitle$, thefile$
    print "Setting attribute ebook:title to " + title$ + "."
    attribute set "String", "ebook:title", title$, thefile$
    print "Setting attribute ebook:author to " + author$ + "."
    attribute set "String", "ebook:author", author$, thefile$
end sub

sub showhelp()
    for a=1 to arraysize(docu$(),1)
        print docu$(a)
    next a
    exit
end sub

sub parse()
    local without_asterixes$, character$, postitle, posauthor, search1$, search2$
    //change the following 2 lines for books in other languages
    search1$ = "The Project Gutenberg EBook of "
    search2$ = ", by "
    //some PG files have asterisks in them. Replace these with spaces
    //then remove them later with trim$
    for f = 1 to len(firstline$)
        character$ = mid$(firstline$, f,1)
        if character$ = "*" or character$ = chr$(20) character$ = " "
        without_asterixes$ = without_asterixes$ + character$
    next f
    firstline$ = without_asterixes$
    firstline$ = trim$(firstline$)
        print "Cleaned up the first line:"
        Print firstline$
    postitle = instr(lower$(firstline$), lower$(search1$)) + len(search1$)
    fulltitle$ = trim$(mid$(firstline$, postitle))
    posauthor = instr(lower$(fulltitle$), lower$(search2$)) + len(search2$)
    title$ = trim$(left$(fulltitle$, posauthor - (len(search2$)+1)))
    author$ = trim$(mid$(fulltitle$, posauthor))
end sub
Reply
#2
Yechh, just when you think you have a solution ...

it turns out that Project Gutenberg files are not quite standardized. The first line can read "The Project Gutenberg Ebook of ..." or "Project Gutenberg's Etext of ..." or any combination of the above. Back to the drawing board. But I'll leave this here, someone might be able to use the code for something else.

Also, I've put a few ebooks on the repo to test the waters, but that's not the way to go. You'd end up with literally thousands of packages called ebookXXXXXXX.hpkg and nobody would be able to find my actual apps. Either set up a separate repo for ebooks or think of something different.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)
Free Web Hosting