Does anyone on lemmygrad collect magazines besides me?

Yatta's Spouse@lemmygrad.ml · 2 months ago

Does anyone on lemmygrad collect magazines besides me?

CriticalResist8@lemmygrad.ml · edit-2 2 months ago

Oh if you have a PDF you can still OCR it with minerU, it will be faster in fact because it can extract the text layer (I did the wretched of the earth in 4 minutes flat with it). The problem with PDFs is if you copy the text outright it will look weird because of how PDF handles text. MinerU is also content-aware, meaning it will remove the headers and footers if there are any, which is why I recommend it. It should also normally preserve tables (very important in some books) and styling such as italics and bold, which a simple copy-paste doesn’t. Basically if you copy PDF raw it looks like this:

[...]
You can see that reflected
in the products I have designed, which are often noted for their ease of use. The
most powerful things are simple. Thus this book proposes a simple and
straightforward theory of intelligence. I hope you enjoy it.
81
Artificial Intelligence
When I graduated from Cornell in June 1979 with a degree in electrical
engineering, I didn't have any major plans for my life. I started work as an
engineer at the new Intel campus in Portland, Oregon. The microcomputer
industry was just starting, and Intel was at the heart of it. My job was to analyze
and fix problems found by other engineers working in the field with our main
product, single board computers [...]

As you can see the lines break weirdly, there’s a random page number and chapter reminder in the middle, and it’s missing some bolded text (book is Jeff Hawkin’s On Intelligence).

LLMs can actually work very well with raw PDF text and clean it up for you, but if the text is really chopped up it might need a cleaner copy to start with. But maybe if you want to skip installing a bunch of stuff for minerU to work this could be attempted. Or like I said, if your party is open to the idea, ask them to send you the raw docx files which I’m sure they have (they probably import them into InDesign, and if they don’t, they should), and you can just upload that to deepseek and it will take care of the formatting for you.

Otherwise I’m putting the rest down here in a subsection:

Getting minerU to work

If you’re on windows (which I assume bc you say you are not tech-savvy) you will need to install python 3.12 here https://www.python.org/downloads/release/python-31312/ (scroll down at the bottom for windows installers). During installation make sure to have admin perms and check the “put python in PATH” checkbox or similar (it will say something about PATH).

Once python is installed you can install minerU by opening the cmd, and type pip install mineru[all] (or maybe python pip install mineru[all]). It will take some time but it will install minerU as needed on your computer

Once minerU is installed, in the same cmd window, run mineru-models-download. Once again it will take some time as it installs a bunch of models. Expect it to take around 11 gigabytes of memory on your disk in total.

Once everything is installed, you can simply run an OCR command through this command: mineru -p /path/to/your/document.pdf -o /path/to/output/folder -l [language] again from the CMD window. But you can do that at any time, you don’t need to reinstall everything we just did each time.

If at any point during the installation or trying to use minerU something isn’t clear or you get an error output in the CMD, just send the entire output to deepseek and it will tell you what to do. Use the expert mode with search on. I myself installed minerU through copying the commands Deepseek gave me, didn’t even need to hunt down anything. I ran into a bug then when trying to run it, sent the output to deepseek, and it found the fix in 2 seconds (installing python development version). I can’t stress how stress-free installing technical software has become.

But after that you can quickly and easily OCR any PDF or image on your computer, don’t forget to specify a specific folder for the output as minerU creates a bunch of files, including a markdown file and a JSON. That’s what its OCR output looks like. With pandoc, which is yet another piece of software to install, you can then transform that .md (markdown file) into another without a hitch. To install pandoc on Windows, download it here https://pandoc.org/installing.html (click ‘get the latest installer’ then look for pandoc-3.9.0.2-windows-x86_64.msi) and then you can use the command pandoc text.md -t -o conversion.[extension]

Pandoc is a really thorough program that can convert any text from one format to another, such as html to wikitext to markdown to epub to pdf to XML to whatever. You can find demos here that showcase some conversions: https://pandoc.org/demos.html. XML is what Word and LibreOffice use behind the .docx extension, so if you convert to XML you can probably easily open it in Word afterwards.

If I’m not mistaken if you put the [extension to your output file as .docx for example pandoc should automatically know that it has to convert to docx xml.

So basically minerU OCR’s the PDF into usable text, but in the markdown format. Then with pandoc, you can clone that markdown into a bunch of different other formats, if that makes sense. Keep the markdown format MinerU makes (which is the same styling language we use on Lemmygrad btw), and reconvert it with pandoc into anything you need.

I did all of this myself the other day for the Wretched of the Earth and it worked really well! Just needs some manual cleaning up afterwards, but that’s usually just on the chapter titles and because of the PDF files themselves.

MinerU can also run on your CPU (loaded in RAM) if your GPU can’t handle it, you can find the different options by just typing ‘mineru --help’, and it will tell you how to pick GPU or CPU. I think you need an Nvidia GPU, otherwise use CPU and it should work well too (just takes longer).

Fine-tuning your translation

I’ve been attempting translation work with LLMs for the past few years and I still haven’t found something I’m 100% happy with, though that 2-pass thing (first pass is the translation, second pass is a new conversation where you ask it to proofread and localize like an editor would) yields better results. This is kind of similar to the ‘critique’ they do in training, where you have the model being trained generating an output, and then another model ‘critiques’ it to find problems, and the model in training has to improve to fool the critique. I would try things around this concept, like sending a model both the original text and the LLM translation and asking it to compare, proofread, and fix.

LLMs are not great with all languages because they don’t necessarily train on those languages. So it’ll really depend on the model, you should try a few with the same prompt and input text (just one page of a book is fine, preferably one that is representative of the difficulty of the task). Then once you find a model that seems to handle hungarian fine, refine your prompt when you send it the text. It might be a very long prompt. You might have to include a glossary of technical terms that need to be translated the same way each time, and you might need to specify a bunch of other stuff like what sort of language register to use etc.

And basically you refine your prompt bit by bit like this until you get something that seems “good enough” for you. I find that it’s important to tell them to “write naturally without changing the content or the ideas - you are an editor, not an author” or something like that.

Once you’re happy with your prompt though you can save it somewhere on your computer to always have it around, and just reuse it each time.

As for language pairs yes it could probably work both ways. I.e. if you can get hungarian translated to english in good quality (by an LLM), you can probably get the LLM to also translate english to hungarian. older methods and humans are more finicky lol, but in my opinion LLMs should have no problem with language pairs as long as they know one of the two languages sufficiently.

Agentic pipeline

The pipeline python scripts I was talking about is, you guessed it, more LLM stuff. Join us over on [email protected] to learn how to start using agentic on your computer. But basically put 5$ in the deepseek API, install crush, and then have the agent code you a bundle of scripts to automate most of the process for you. It’s what I did to get ProleWiki translated to French, it’s a collection of 4 different python scripts, all LLM-coded, to 1. download our pages, 2. translate them intelligently with an LLM (with progress tracking, cutting up big files into chunks etc), 3. clean up the translation artefacts due to the model and 4. upload the translated pages to ProleWiki.

You don’t need to know computers or how to code anymore to have this kind of stuff and I think that’s pretty cool. It definitely helps but for something simple like that you don’t need to be too technical. The code might get complex, but you let the AI handle it. You’re the client for the script, you don’t need to know how it works, just that it does.

It’s more involved but then you could have a mostly automated pipeline that runs minerU on the pdfs, sends them to an LLM API to get them translated (like I have), then labels and saves the translations or something. That way instead of doing every step yourself you just run the script.

But if you have PDFs you could probably just feed them manually to Deepseek tbh, by just uploading them in the chatbox.