• 0 Posts
  • 36 Comments
Joined 1 year ago
cake
Cake day: June 6th, 2025

help-circle
  • Enthusiasts will care because it could save them storage space for equivalent quality

    That’s the thing, even if you ignore that such scenarios involve lossy-to-lossy re-encodes (bad), and even when you ignore the general lack of psychoacoustic tuning in new encoders, the advertised so called objective “20%-30% improvement” is not universal, and only applies to bit-starved resolution-maxed encodes.

    Your file is 1080p or 720p? you won’t get that improvement, even in not-fit-for-purpose “objective” measures.

    You want to encode at a higher bitrate than YouTube to actually get good quality? you won’t get that improvement either.

    So if you embark on such a futile journey, you could be wasting a lot of computing power for no, or even negative, gain.



  • So I gave the actual code a one minute look (literally).

    Picked src/radicle/util.c, since that was the last file touched.

    The level of defensive programming doesn’t look that good (and I’m trying to be nice here).

    Here is an example, and note that I didn’t do C in a while:

    #include <stdio.h>
    #include <string.h>
    
    void rad_rstrip_nl(char* str) {
        int len_str = strlen(str);
        if (str[len_str-1]=='\n') {
          str[len_str-1] = 0;
        }
    }
    
    bool rad_get_input (char* str, size_t bufsiz) {
        if (!fgets(str,bufsiz,stdin)) return false;
        rad_rstrip_nl(str);
        return true;
    }
    
    int main() {
      char a[] = {0,0,0,0};
      bool i = rad_get_input(a, 4);
      printf("%lu\n", strlen(a));
      rad_rstrip_nl(a);
      return i;
    }
    

    The two functions above main() are copy-pasted from that file.

    Let’s zoom in:

    int len_str = strlen(str);
    if (str[len_str-1]=='\n') {
    

    Here we’re accessing str[len_str-1] without checking len_str first.

    But you might be thinking, maybe len_str can’t be zero!

    Let’s compile first with the AddressSanitizer enabled:

    # compile
    % gcc -Wall -fsanitize=address t.c -o t
    

    Now let’s see how easily we can have fun:

     % echo -n '\0' | ./t
    =================================================================
    ==2949689==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7ba827af001f at pc 0x56032434d259 bp 0x7fff1d199010 sp 0x7fff1d199000
    READ of size 1 at 0x7ba827af001f thread T0
        #0 0x56032434d258 in rad_rstrip_nl (/tmp/t+0x1258) (BuildId: 1ee68e4d67960002de80ae290c8811c63f94aa51)
        #1 0x56032434d311 in rad_get_input (/tmp/t+0x1311) (BuildId: 1ee68e4d67960002de80ae290c8811c63f94aa51)
        #2 0x56032434d3e4 in main (/tmp/t+0x13e4) (BuildId: 1ee68e4d67960002de80ae290c8811c63f94aa51)
        #3 0x7fa82a227740  (/usr/lib/libc.so.6+0x27740) (BuildId: 020d6f7c33b2413f4fe10814c4729dce1387f049)
        #4 0x7fa82a227878 in __libc_start_main (/usr/lib/libc.so.6+0x27878) (BuildId: 020d6f7c33b2413f4fe10814c4729dce1387f049)
        #5 0x56032434d124 in _start (/tmp/t+0x1124) (BuildId: 1ee68e4d67960002de80ae290c8811c63f94aa51)
    

    (The rest of AddressSanitizer output omitted.)

    Another function from the same file:

    char* rad_strcpy (char* out, const char* inp, int from, int len) {
        const char* inp_shifted = inp+from;
        int len_inp_shifted = strlen(inp_shifted);
        if (len <= len_inp_shifted) {
    	memcpy(out,inp,len);
    	out[len] = 0;
        }
        else {
    	memcpy(out,inp,len_inp_shifted);
    	out[len_inp_shifted] = 0;
        }
        return out;
    }
    

    Here, inp is shifted before inp length is checked, which doesn’t look safe. But my one minute is up, so I didn’t dive into the function callers.


    Pretending C is a good choice in 2026, then not being extra vigilant with defensive programming, is not a good look. I remember myself being more vigilant in my wrappers even when I was a beginner.

    This is made worse by the developer repeating literal memes like:

    One issue I have with rust is that it adds another layer of trusting the compiler isn’t backdoored. All UNIX/Linux systems use the gcc toolchain

    Maybe such an enlightened developer should know that you can bootstrap rustc from mrustc using GCC.


  • Most tor peers are not relays. So no, tor’s network capacity doesn’t auto-scale with more users, even when you’re sticking to hidden services.

    And you didn’t argue for anonymity from the start. And anonymity is a BIG argument, with bigger design implications than you think.

    Original Freenet (now called Hyphanet) predates both bittorrent and tor. And it’s one early example (and not the only one btw) of how you properly combine anonymous storage with anonymous transport (content addressing too, but that’s more of a jibe against the IPFS meme). It’s also (relatively) slow, and that’s actually intentional, at least in part, because speed can hurt your anonymity (the details are too technical, and that’s not the place to delve into them).

    Bittorrent didn’t lack (native) anonymity because the idea/tech was impossible to imagine. Anonymity didn’t come into the picture because availability and speed were the priorities. The protocol didn’t have encryption from the start either (or sub-piece downloading, or DHT, or PEX, or udp trackers, or uTP transport, but I digress).



  • How do you send a threat to an IP address? 😏 about supposed push of code encrypted no less. Unless, you’re thinking ISP involvement, that would be a hilarious (single) e-mail to read (from the “lawyer” to the ISP, because there will be no other correspondence).

    If the threat model is “lawyer”, developers will be fine. If it’s a “state actor” and/or all users need protection, then again, this is a whole other conversation. If it’s something in between, then yes, maybe developers/publishers should specifically be careful, and/or maybe the design of the software should help them, but without compromising the performance of the whole network. But again, bittorrent will not be the right protocol for this anyway.


  • P2P already gives you anti-censorship. Publisher anonymity shouldn’t be hard for developers to achieve either.

    If full network anonymity is desirable, then you would need a full top-down design to do it properly, and this becomes a whole different conversation (and the choice of using bittorrent itself will have to be revisited). But really, I don’t think that’s needed here.

    Pluggable transports can still be useful for varying reasons of course. I don’t think anyone would argue against them.

    But I would still opine that forcing a slow network on any forge alternative is the fastest way to keep 99% of potential users away.







  • How many layers should I go through?

    Here is a few:

    • PR’s replaced patch sets. Patch sets have nothing to do with “strangers”. Both are the medium where review for a logical grouping of code changes takes place. There is no separate categories here.
    • In most open-source projects, everyone involved is a “stranger” to others anyway, including co-developers if any.
    • PR’s/patchsets are orthogonal to T*D/Trunk-Based/Team-focused development. How can this be missed is hard to imagine. I would have assumed everyone is aware of draft/wip/rfc PR’s, or dev/trunk branches. And tests need development alongside functional modifications anyway.

  • Your comment contains an implicit assumption; there is always a co-occurrence between active development, and all ever grown interest in a project.

    A person could grow a newfound interest in a repo after 1/3/5/10/20 years of inactivity. Most people are not glued to their chairs watching endless feeds, and bookmarking/starring (and maybe forking) all repos of interest away. The “normal” chain of events usually starts with a person growing a need for certain functionality (for research or direct use), and then checking out all tools, libraries, or resources available related to that functionality.

    Relying on users to only “seed” repos they approve of is not a good strategy for high availability, for many reasons, not the least of which is the tendency of some users to develop tantrums over time, and pressing the “remove account and delete all history” button*. This is why anonymous distributed storage is unrivaled as an availability provider, at least for a period. Long term availability however still requires frequent re-grabbing or re-insertion (both have the same “refreshing” effect in these networks).

    *Pushing code repos themselves to the side again, a decision will also have to be made with regards to whether the “ghost” behavior from GitHub should be replicated, or should “respecting the user wishes” to really delete EVERYTHING take precedence. Deciding this is important as it would/should be a part of the user agreement.



  • Forgejo/Codeberg is the one that will take over in the coming decade.

    This is both wishful thinking, and would reintroduce the same problem anyway (centralization) if it would happen (the codeberg part).

    I don’t take seriously individuals celebrating a move to self-hosting either. While it may look cool and ideally liberating at first, infrastructure/hosting responsibility has worse bus factors and burnout than actual development (not to mention actual monetary costs). It’s safe to assume that any code self-hosted has a high chance of becoming unreachable in 1-3 years (and yes, exceptions exist).

    Solutions like radicale don’t help with unpopular repos, as you would again get a (hosting) bus factor of 1 (the dev/seeder), if that.

    A theoretical solution leveraging an anonymous encrypted distributed storage network for repos would help keeping code alive for a while (after the bus hits). But unpopular content will eventually fizzle out, out of the network.

    Multiple congregations of Forgejo (or something similar) communities forming would be cool. But the technology that would help them form one social block with network effects doesn’t exist*. And what’s proposed here and there (like federation for issues) doesn’t cover the code itself. And even if we get far in that direction, instance drama incidents, and attempts at exerting control over “the network” will inescapably appear.

    * I don’t know if tangled counts. But judging by the amount of love 😑 people show the AT protocol, it may as well not exist.


    tl;dr: Codeberg will not become GH-big. And if it did, it wouldn’t be a good thing. And yet there is no ideal alternative to central forges anyway, not even a theoretical one.


  • LRU inversion: the problem with not caring about it is that it’s not a visible problem until it very suddenly is. Your system will not gradually degrade but very suddenly and unpredictably hit a wall that it cannot get itself over.

    All this talk just confirms my feelings that there is a general lack of understanding of actual modern workloads.

    RAM (normal w/wo zram) doesn’t get full, then stay full forever in real workloads. Not only is that not realistic at the “opened apps”/“running processes” level, it’s not real at the heap allocation level within tasks within processes. And this is much more pronounced with code written in modern languages like Rust and some styles of C++. Modern heap allocators batch and cache (primarily to help with performance). But still, A LOT of memory is getting allocated and deallocated all the time, even from the kernel’s PoV.

    LRU itself is an imperfect approximation, not a goal. In the setup described in my other comment (fast SSD swap storages only used sparingly most of the time), so called LRU inversion gets auto-cancelled relatively quickly, as free space in RAM(+zram) gets available all the time, and some “LRU-hot” pages in SSD swap turn out to be actually cold, and those ones are the only ones that actually stay there.

    This is why, I would imagine a lot of fake scenarios, and “benchmarks” based on them, may fail to replicate the practical reality of many (overall system) use-cases.


    More tangentially, the oversized concern for file caching pages also points to specific aligned use-cases in mind, as if everyone is running DB-centric workloads or something.


  • This is not a good thing btw. Any unused anonymous page takes up space that could instead be used for file-backed pages that make your system faster.

    Can you expand here. I think my attempt at brevity in this part wasn’t helpful.

    Swap is not tiered storage!

    I meant tiered with priorities only, yes.

    Cool tech but it’s dead and was quite niche even when it was alive.

    We are not talking about the original purpose of Optane as supported on Windows. It’s just a (perhaps somewhat outdated) example of a storage device “smaller but faster than your average SSD storage”, which is very much not did tech.

    Not a thing you actually want to use for swap

    Depends on the use-case. But yes, this can also be used as the fastest disk tier/priority of normal swap devices, which is why I mentioned both.

    This makes no sense at all unless you are extremely space-constrained on the NVMe and absolutely must not OOM – even if progress stalls to an absolute crawl.

    Why would you want to see killed processes when you go back to your workstation, in the 1/10000th scenario where something runs amok pushing memory usage to unexpected high levels? When you can simply investigate the reason behind the rare occurrence, then move all the pages off the slowest devices immediately with swapoff?



  • Alright, I will only reply to you, since you raised a fair question.

    First of all, I must admit that I thought what was linked was an earlier similar writing, but the general theme is still the same.

    The problem with the writing is that it focuses on use-cases like Android and some servers, but doesn’t take into account other use-cases. It also seems to come with the assumption that setup is done by the distributor only, or if it’s done by the user, it’s a configure-and-forget situation.

    What he represents is:

    • Limited RAM space
    • Swap will always/often happen (outside of (z)ram)
    • Single tier of non-RAM swap
    • Non-ram swap is significantly slower
    • OOM can be preferable over (outside of ram) swapping
    • Swapped out pages stay where they are until they are required by their process (important).

    Now let’s look at a possible modern workstation setup:

    • Large RAM size
    • Swap is rarely hit, especially if set up with zram.
    • Multiple swap tiers beyond zram/zswap
      • Intel Optane disk used as a super-fast zram write-back device, or a high-priority swap
      • Fast NVME disk used as a second tier swap disk
      • Large HDD swap partition used as a third tier swap disk
    • The biggest consideration is avoiding worst case latency, i.e. hitting HDD swap.
    • Killing processes MUST be avoided, unless exceptional circumstances are hit where the kernel’s OOM would kick in anyway. This holds true even when HDD swap starts getting used.
    • When unusual loads are observed, swapped pages can be moved around by the user (or a tool), by turning swap devices off and on. This is how you can empty the HDD swap partition for example.

    This last point in particular should make it clear why his “imagination” was rather limited in his LRU inversion section.