techsupport

160 readers

3 users here now

Having tech problems?

Ask other users, and try to help others with their problems as well.

Guidelines

Please include your system specs, such as Windows/Linux/Mac version/build, model numbers, troubleshooting steps, symptoms, etc.

founded 2 years ago

MODERATORS

sting@sopuli.xyz

TIFF → DjVu conversion produces bigger file from bilevel doc than color (sopuli.xyz)

submitted 1 year ago* (last edited 1 year ago) by freedomPusher@sopuli.xyz to c/techsupport@sopuli.xyz

0 comments fedilink hide all child comments

cross-posted from: https://sopuli.xyz/post/8936481

I would like to get to the bottom of what I am doing wrong that leads to black and white documents having a bigger filesize than color.

My process for a color TIFF is like this:

① tiff2pdf ② ocrmypdf ③ pdf2djvu

Resulting color DjVu file is ~56k. When pdfimages -all runs on the intermediate PDF file, it shows CCITT (fax) is inside.

My process for a black and white TIFF is the same:

① tiff2pdf ② ocrmypdf ③ pdf2djvu

Resulting black and white DjVu file is ~145k (almost 3× the color size). When pdfimages -all runs on the intermediate PDF file, it shows a PNG file is inside. If I replace step ① with ImageMagick’s convert, the first PDF is 10mb, but in the end the resulting djvu file is still ~145k. And PNG is still inside the intermediate PDF.

I can get the bitonal (bilevel) image smaller by using cjb2 -clean, which goes straight from TIFF to DjVu, but then I can’t OCR it due to the lack of PDF intermediate version. And the size is still bigger than the color doc (~68k).

#askFedi

update

I think I found the problem, which would not be evident from what I posted. I was passing the --force-ocr option to ocrmypdf. I did that just to push through errors like “this doc is already OCRd”. But that option does much more than you would expect: it transcodes the doc. Looks like my fix is to pass --redo-ocr instead. It’s not yet obvious to me why --force-ocr impacted bilevel images more.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here