среда, 14 июня 2017 г.

Как читать PDF и DJVU файлы формата А4 на читалках 6” и на смартфонах


Тот, кто пользовался шестидюймовой электронной читалкой, я думаю, знает, что читать PDF книги формата A4 на таком маленьком экране не очень-то удобно: чтобы текст был крупнее, нужно разворачивать текст на 90° и читать полу-страницами, потому как текст не подстраивается под размеры экрана. Хотя даже в таком случае текст мелковат.
Но недавно я нашёл программу k2pdfopt, которая делает довольно качественный «reflow» для заданного PDF или DJVU файла и сохраняет результат в новый PDF-файл нужного формата (по умолчанию  — под 6” читалку). Проще всего объяснить, показав для сравнения исходный и результирующий тексты:
ScreenClipScreenClip [1]

Кроме того:
  1. программа понимает двух-колоночную вёрстку в исходном файле;
  2. программа умеет генерировать файлы для «ландшафного» просмотра;
  3. поддерживаются различные разрешения и DPI (т.е. можно переформатировать и под 3,7” смартфоны);
  4. цвета в цветных документах могут сохраняться или конвертироваться в чёрно-белые;
  5. есть версия под Windows, Linux, Mac OS X;
  6. (бонус!) есть специальная версия для Kindle 3 (написана другим автором), чтобы производить конвертацию прямо на читалке Kindle!
Программа работает в командной строке, плюс для неё есть отдельное GUI — K2PDFOPT Windows GUI, написанное другим автором.
Чтобы добиться отличного результата, нужно немного поморочиться, но это достаточно сделать один раз, и впоследствии использовать подобранные параметры:
  1. Чтобы в полученном PDF можно было использовать поиск по тексту и словаря по наведению на текст, нужно чтобы программа внедряла текст в PDF. Для этого нужно установить систему распознавания текста Tesseract (подробнее — тут) и при установке этой программы выбрать необходимые языки, после чего включить OCR (распознавание текста) в параметрах (встроенная система GOCR даёт плохой результат). Скорость конвертирования при этом в несколько раз ниже, но что поделаешь: OCR — операция не быстрая.
  2. Чтобы получить качественные шрифты с гладкими буквами, нужно задать более высокий DPI, я указываю -dr 2.
  3. Если текст в исходном файле всегда одноколоночный, то лучше задать параметр -col 1.
В случае, если результат конвертации вас не устраивает, настоятельно рекомендую почитать FAQ по K2PDFopt — там объяснено, какие параметры задавать в случае проблем.
Чтобы запустить конвертирование, проще всего перетащить PDF файл на иконку программы, после чего откроется окно консоли программы, где можно задать параметры в текстовом меню (ага, как в 90-х годах). Но чаще всего достаточно просто нажать [Enter] либо задать номера страниц для конвертирования, и запустить конвертацию, так что интерфейс не особо-то и нужен.

3 коммент.:

  1. FREQUENTLY ASKED QUESTIONS (last updated 28 Dec 2016)

    Why does the Mac OSX version not run on my mac?
    If you are running OSX 10.5.x or earlier, k2pdfopt may not run on your system. See the first paragraph in my Mac install notes.

    How do I increase the text size?
    See the help page on increasing the magnification.

    The output file size (in bytes) is large. Can I make it smaller?
    With the default conversion, which allows text re-flow, every converted page is a bitmap, so the file size of the converted file is often larger than the original; however, many e-readers can process PDF files made up of bitmaps faster and with less memory overhead than the original PDF file, so you might still prefer this type of conversion. If you still want a smaller output file size, see my help page on output file size for options that reduce the output file size, mostly at the expense of the output quality. If you don't need text re-flow, you might try using a mode which converts using native PDF output.

    I just want k2pdfopt to remove the excess borders on my PDF file. Can it do that?
    Absolutely. As of v1.60, the shorthand option for this is -mode fw (fw = fit width), which is equivalent to -n -wrap- -col 1 -vb -2 -t -ls. If you still want to rasterize the output, use -mode fw -n-. If you don't want to turn the document on its side, use -mode fw -ls-. You can select the mode from the user menu by typing "mo" at the prompt. Here are some examples of other k2pdfopt modes.

    I want to use k2pdfopt like Briss--select a crop region for the document and put only that region in the output PDF.
    This is easy when you use the MS-Windows GUI. Make one of the "Crop Areas" active (check box); type in the applicable page range for the crop box (e.g. 2-99), then click the blue Select button and choose your crop region. For the conversion mode, select Crop (command-line: -mode crop).

    I just want k2pdfopt to OCR my document. Can it do that?
    Yes. Try using -mode copy -ocr as command-line options. See the OCR help page.

    ОтветитьУдалить

  2. How do I extract the text from a PDF file to an ASCII/UTF-8 file?
    k2pdfopt -ocrout textout.txt -mode copy myfile.pdf

    The output file has poor resolution on my device. Can I improve it?
    Definitely. The default k2pdfopt settings are for a kindle 2, and your device may have better or slightly different resolution. You can change the device by using -dev (interactive menu option "d"). Or see my page on setting k2pdfopt for any custom device resolution. You can also just use, for example, -dr 2 (new option in v1.60), which increases the display resolution by a factor of two. This drawback is that your converted files will be significantly larger and may take longer to render on your device, so you may want to experiment to find the right value (you can use fractions, e.g. -dr 1.5). You can type this option directly into the user menu prompt, e.g. "-dr 2" (without the double quotes).

    When I convert with native PDF output, my kindle has problems reading the output file (runs out of memory / very slow / crashes). Why?
    There are likely too many cropped-and-scaled regions in your output file. Try using a specific conversion mode instead. Modes are shorthand for setting a collection of options that are best suited for s specific type of optimization. See the native PDF help page. Another option is to use a mode like -mode 2col or -mode fitwidth which defaults to native output and then turn off the native output by unchecking the "Native Output" box or specifying -n- (after the -mode ... option) on the command line. The output will look the same, and it will still have searchable and highlightable text, but it will be bitmapped. For some devices and/or documents, bitmaps are faster and require less memory overhead to render. If the bitmapped text is too grainy, you can use -dr to improve it, e.g. -dr 2 will double the resolution of the output bitmaps.

    I'm having trouble selecting text in the converted PDF file with native PDF output.
    If there are more than one cropped/scaled regions on an output page, most PDF reading applications will get confused and allow selection of "invisible" text which is outside cropped regions and which overlaps with displayed text. As of k2pdfopt version 2.31, if you have ghostscript installed, you can use -ppgs to post process your PDF with ghostscript (in the MS Windows GUI, check the "Post-process w/Ghostcript" box), which very nicely eliminates this issue (thank you to Andrea Lazzarotto for suggesting this modification). You can also use -bp m (in the MS Windows GUI, check the "Avoid Text Select Overlap" box) to force only one cropped region per output page, but this may result in a lot of blank space in your converted file. Or, like in the previous answer, you can use a mode like -mode 2col or -mode fitwidth which defaults to native output and then turn off the native output by unchecking the "Native Output" box or specifying -n- (after the -mode ... option) on the command line. The output will look the same, and it will still have searchable and highlightable text, but it will be bitmapped and will not have any overlapping "invisible" text outside the crop regions. For some devices and/or documents, bitmaps are faster and require less overhead to render. If the bitmapped text is too grainy, you can use -dr to improve it, e.g. -dr 2 will double the resolution of the output bitmaps.

    I don't understand how k2pdfopt is interpreting my PDF file.
    Try using the -sm command-line option ("sm" from the interactive menu), which will write out a PDF file that shows the regions found by k2pdfopt.

    I want to use text re-flow, but my tables / equations / figures get mangled.
    Try protecting those regions by drawing boxes around them.

    ОтветитьУдалить


  3. Is there a k2pdfopt GUI (graphical user interface)?
    There is now an integrated MS-Windows GUI (as of v2.x), and there are also a number of user-contributed front ends for k2pdfopt.

    Can k2pdfopt run directly on my kindle?
    Yes. See the information on my third-party contributions page.

    How do I prevent images / figures from being split across pages?
    Use -f2p -1, or select "bp" from the interactive menu and enter -1 for the "fit-to-page" value.

    The columns in my document are not detected correctly.
    See the column detection help page.

    Some of the text is much larger than the rest. How can I avoid that?
    If your document does not have multiple columns, try turning off multiple column detection with command-line option -col 1 (interactive menu option "co"). See the page on column detection and also the page on showing markings so that you can see how k2pdfopt is converting your document.

    How can I get rid of the document headers, footers, page numbers and/or other marks near the edges of the source pages?
    You can tell k2pdfopt to ignore an arbitrarily sized border around your document. See Ignoring Borders/Headers/Footers.

    Sometimes I get multiple rows of text at smaller magnification than the rest of the document. Why?
    This generally happens when there is not a clear gap between rows of text and k2pdfopt thus views the region as a graphical block (figure) rather than as rows of text. If you haven't updated to v1.65, you should do so. K2pdfopt v1.65 is smarter about breaking rows--if it detects a double- or triple-height text row amidst other single-spaced text rows, it will usually fix this. See this post (and the reponse) for tips on how to adjust your k2pdfopt settings.

    Is there any way to search / highlight the text in the converted PDF file?
    Yes, as of v1.50, k2pdfopt has OCR capability, and as of v1.60, k2pdfopt has options for native PDF output, much like Cut2Col, SoPDF, and the latest version of PaperCrop. In fact, as of v2.00, if the text in your source PDF document can be searched or highlighted (e.g. if it is computer generated or scanned with an OCR layer), the default output from k2pdfopt should have the same functionality using the new virtual OCR feature (see "major new features" under v2.00 in the version history for more details). In these cases (when the source PDF is computer generated or has an OCR layer), it is not necessary to use the Tesseract OCR engine in k2pdfopt. Note that PDF highlighting is not possible on some e-readers, such as early Kindles (Kindle 1/2).

    My PDF file has a lot of pages. Can I convert only certain pages?
    Yes. In the Windows GUI, look for the "Pages to Convert" box and enter a page range, e.g. 1-100. Or see the Windows Getting Started page and scroll down to "2. Enter Page Range" (or use the -p command-line option).

    Some of the text is truncated / clipped. Can I fix that?
    In versions before v1.65, k2pdfopt ignores (crops) a 0.25-inch border around your document by default. Turn this off by using command-line option -m 0 (which is no longer necessary in v1.65--the default is now -m 0). See Ignoring Borders/Headers/Footers.

    ОтветитьУдалить