My main account is dark_arc@social.packetloss.gg. However, as of roughly 24-hours ago (it seems this has been going on since March 10th and gotten worse since) it seems like the server has stopped properly retrieving content from lemmy.world.

It’s been running smoothly for well over 9 months, and (I think) working fine for content coming in from other instances. So I’m curious if anyone else experienced anything strange with lemmy.world federation recently?

Setup Description

The server flow in my case is as follows:

[Public Internet] <-> [Digital Ocean Droplet] <-> [ZeroTier] <-> [Physical Machine in my Basement (HW Info)]

The Digital Ocean droplet is a virtual host machine that forwards requests via nginx to the physical machine where a second nginx server (running the standard lemmy nginx config) then forwards the request to the lemmy server software itself.

Current Status

Lemmy Internal Error

I’ve found this is my lemmy logs:

2024-03-24T00:42:10.062274Z  WARN lemmy_utils: error in spawn: Unknown: Request limit was reached during fetch
   0: lemmy_apub::objects::community::from_json
             at crates/apub/src/objects/community.rs:126
   1: lemmy_apub::fetcher::user_or_community::from_json
             at crates/apub/src/fetcher/user_or_community.rs:87
   2: lemmy_server::root_span_builder::HTTP request
           with http.method=POST http.scheme="http" http.host=social.packetloss.gg http.target=/inbox otel.kind="server" request_id=688ad030-f892-4925-9ce9-fc4f3070a967
             at src/root_span_builder.rs:16

I’m thinking this could be the cause … though I’m not sure how to raise the limit (it seems to be hard coded). I opened an issue with the Lemmy devs but I’ve since closed it while gathering more information/making sure this is truly an issue with the Lemmy server software.

Nginx 408 and 499s

I’m seeing the digital ocean nginx server reporting 499 on various “/inbox” route requests and I’m seeing the nginx running on the physical machine that talks directly to lemmy reporting 408 on various “/inbox” route requests.

There are some examples in this comment: https://lemmy.world/comment/8728858

  • seang96A
    link
    fedilink
    English
    arrow-up
    6
    ·
    9 months ago

    If your failing on Lemmy.world like others suggest, but federation is working with others, it’s likely you are getting large queries that are timing out with lemmy.world. You can check for timeouts in your proxy’s logs.

    Troubleshooting tips

    This guide is for tuning postgres if you find it is an issue. This has good tips and tips on how to analyze issues on postgres. It’s oriented for matrix tuning though. work_mem and temp files likely will need constant tuning to meet your demands if you have limited memory for postgres.

    Lastly I wanted to note that Lemmy does application connection pooling, so if you have a Pooler service behind postgres, dont, let Lemmy connect directly to postgres

    Hope these help. I went from a recent 1 million federation backlog in lemmy.world to near none. I believe I am at a ram limit and a couple expensive queries (deleting posts for example) are creating temp files and taking too long.

    • Dark Arc@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      9 months ago

      So, I think you’re most on the right track of the responses…

      It seems to just be exclusively incoming from lemmy.world. If you look here, my most recent comment is on lemmy.world:

      https://social.packetloss.gg/comment/1415801 https://lemmy.world/comment/8710941

      The instance just isn’t getting any new posts, comments, or votes back from lemmy.world.

      Everytime I shut down the lemmy server I see this:

      2024-03-23T17:34:33.774333Z  WARN lemmy_server: Received ctrl-c, shutting down gracefully...
      2024-03-23T17:34:33.774912Z  WARN lemmy_federate: Waiting for 1618 workers (30.00s max)
      

      That number never seems to move, there are always 1618 works. I’m not sure if that means anything or not regarding pending processing or what have you.

      I am seeing in my publicly facing nginx logs:

      135.181.143.221 - - [22/Mar/2024:12:23:28 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:23:40 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:23:54 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:24:12 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:24:38 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:25:21 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:26:35 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:28:53 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:33:19 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:42:01 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:12:59:15 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:13:33:33 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      135.181.143.221 - - [22/Mar/2024:21:31:55 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      

      There’s then an internal nginx server that sees:

      10.241.127.2 - - [22/Mar/2024:12:23:18 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:24:19 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:24:31 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:24:45 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:25:03 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:25:29 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:26:11 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:27:25 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:29:43 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:34:09 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:12:42:51 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:13:00:06 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:13:34:24 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:14:42:49 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:16:59:32 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [22/Mar/2024:21:32:45 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      10.241.127.2 - - [23/Mar/2024:06:39:03 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
      

      So, things did start timing out. I’m not sure what to do about that though.

      This server is not resource starved:

      load average: 0.04, 0.09, 0.10
      
                    total        used        free      shared  buff/cache   available
      Mem:          31507        7651        1092         164       22764       23239
      Swap:         16087           1       16086
      

      It’s just this lemmy.world data that’s suddenly out of wack after months of normal operation (both on lemmy 18 and 19).

      It feels like a bad payload that the server just can’t move past for some reason and lemmy.world keeps sending.

      I had logging on the lemmy container itself piped to /dev/null because it’s just such a noisy log. I turned it back on… I’ll see if I can find more information next time lemmy.world posts.

      • seang96A
        link
        fedilink
        English
        arrow-up
        2
        ·
        9 months ago

        Another note you want to have a round trip from federation in under 3 seconds. Those 499 are likely 10 second timeouts from Lemmy.worlds side. Mostly from a long query being the culprit. Lemmy federation in 0.19 processes changes in order per instance so lemmy.world won’t sync newer stuff and keep sending the same large request. I believe there is a retry limit though.

        • Dark Arc@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          9 months ago

          Yeah, I mean things should be fine in general; like I said this has been working for quite a long time now without issue.

          The machine that’s actually doing the work here is quite powerful and is used to run several game servers in addition to Lemmy … Lemmy really isn’t much more than footnote in resource usage:

          CPU:
            Info: 8-core model: Intel Core i7-10700 bits: 64 type: MT MCP cache: L2: 2 MiB
            Speed (MHz): avg: 4653 min/max: 800/4800 cores: 1: 4698 2: 4685 3: 4786 4: 4704 5: 4694
              6: 4700 7: 4800 8: 4801 9: 4802 10: 3408 11: 4756 12: 4713 13: 4706 14: 4707 15: 4798 16: 4703
          Drives:
            Local Storage: total: 931.51 GiB used: 380.39 GiB (40.8%)
            ID-1: /dev/nvme0n1 vendor: Western Digital model: WDS100T2B0C-00PXH0 size: 931.51 GiB
          Partition:
            ID-1: / size: 914.18 GiB used: 380.02 GiB (41.6%) fs: xfs dev: /dev/dm-0
            ID-2: /boot size: 1014 MiB used: 370 MiB (36.5%) fs: xfs dev: /dev/nvme0n1p2
            ID-3: /boot/efi size: 598.8 MiB used: 5.8 MiB (1.0%) fs: vfat dev: /dev/nvme0n1p1
          Swap:
            ID-1: swap-1 type: partition size: 15.71 GiB used: 1.2 MiB (0.0%) dev: /dev/dm-1
          Sensors:
            System Temperatures: cpu: 28.0 C pch: 26.0 C mobo: N/A
            Fan Speeds (rpm): N/A
          Info:
            Processes: 358 Uptime: 16h 39m Memory: total: 32 GiB note: est. available: 30.77 GiB
            used: 8.54 GiB (27.8%) Init: systemd target: multi-user (3) Shell: fish inxi: 3.3.30
          
          • seang96A
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            9 months ago

            How much RAM is postgres using? I’m running on i7 11th Gen and 16gb of RAM so your hardware is fine. This is the resource usage of my setup. First 6 are Lemmy itself broken up to scale and for redundancy. I have a node down right now normally using 3 http ones. Http ones receive federation from others and the lemmy-instance2s are the databases setup to be highly available too.

            NAME CPU(cores) MEMORY(bytes)

            lemmy-6b9fbd75f6-4q4n8 1m 5Mi

            lemmy-fed-0 2m 18Mi

            lemmy-fed-1 1m 18Mi

            lemmy-fed-2 6m 17Mi

            lemmy-http-57886f784f-5x9vb 4m 73Mi

            lemmy-http-57886f784f-tpbtb 13m 121Mi

            lemmy-instance2-czc6-0 47m 3362Mi

            lemmy-instance2-jpw7-0 13m 1017Mi

            • Dark Arc@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              9 months ago

              Not sure what you’re using to generate that list/formatting is a bit difficult.

              I don’t have a cluster since it’s effectively single user + @Auto_Post_Bot@social.packetloss.gg (in theory a few other people have access, but they’re not active), single machine, it’s just more or less the out of the box docker stuff on a bare metal machine in my basement + a digital ocean droplet.

              The droplet is what I’m using to have a static IP to prevent dynamic DNS nonsense + it provides some level of protection against a naive DDoS attack on random fediverse servers (since I can in the worst case, get on my phone and severe the ZeroTier connection that’s using to connect the droplet to my basement server).

              I’m pretty confident whatever is going on is payload related at this point.

                  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
                50622 70        20   0  330264 240200 201512 S   0.0   0.7   0:25.21 postgres
                50636 70        20   0  327804 239520 201296 S   0.0   0.7   0:26.55 postgres
                50627 70        20   0  327204 239152 201592 S   0.0   0.7   0:24.75 postgres
                50454 70        20   0  328932 238720 200872 S   0.0   0.7   0:26.61 postgres
                50639 70        20   0  313528 217800 193792 S   0.0   0.7   0:03.13 postgres
                50641 70        20   0  313284 217336 194204 S   0.0   0.7   0:03.15 postgres
                50626 70        20   0  313592 216604 193636 S   0.0   0.7   0:05.07 postgres
                50632 70        20   0  313236 216460 193968 S   0.0   0.7   0:04.52 postgres
                50638 70        20   0  310368 216084 193856 S   0.0   0.7   0:04.20 postgres
                50614 70        20   0  310520 216072 193840 S   0.0   0.7   0:02.88 postgres
                50642 70        20   0  312200 215920 194068 S   0.0   0.7   0:04.46 postgres
                50640 70        20   0  312584 215724 193676 S   0.0   0.7   0:03.32 postgres
                50635 70        20   0  309744 215404 193764 S   0.0   0.7   0:02.72 postgres
                50630 70        20   0  312168 215224 193488 S   0.0   0.7   0:02.67 postgres
                50621 70        20   0  309560 215096 193772 S   0.0   0.7   0:02.97 postgres
                50646 70        20   0  309492 215008 193560 S   0.0   0.7   0:04.66 postgres
                50625 70        20   0  309760 215004 193368 S   0.0   0.7   0:03.08 postgres
                50637 70        20   0  309296 214992 193848 S   0.0   0.7   0:02.87 postgres
                50616 70        20   0  310596 214984 192700 S   0.0   0.7   0:04.17 postgres
                50643 70        20   0  310392 214940 194008 S   0.0   0.7   0:04.14 postgres
                50624 70        20   0  310128 214880 192928 S   0.0   0.7   0:04.15 postgres
                50631 70        20   0  310220 214596 192576 S   0.0   0.7   0:02.71 postgres
                50613 70        20   0  309364 213880 192520 S   0.0   0.7   0:04.06 postgres
                50628 70        20   0  309852 213236 191504 S   0.0   0.7   0:03.04 postgres
                50634 70        20   0  187772 163388 149428 S   0.0   0.5   0:02.87 postgres
                50644 70        20   0  189684 162892 148508 S   0.0   0.5   0:04.11 postgres
                50633 70        20   0  186096 162544 149324 S   0.0   0.5   0:03.20 postgres
                50629 70        20   0  185644 162112 149296 S   0.0   0.5   0:04.62 postgres
                50618 70        20   0  186264 160576 147928 S   0.0   0.5   0:04.10 postgres
                50582 70        20   0  185708 160236 147592 S   0.0   0.5   0:04.10 postgres
                 3108 70        20   0  172072 144092 142256 S   0.0   0.4   0:04.46 postgres
                 3109 70        20   0  172024 142404 140632 S   0.0   0.4   0:02.24 postgres
                 2408 70        20   0  171856  23660  22020 S   0.0   0.1   0:00.76 postgres
                 3113 70        20   0  173536   9472   7436 S   0.0   0.0   0:00.15 postgres
                 3112 70        20   0  171936   8732   7020 S   0.0   0.0   0:01.54 postgres
                 3114 70        20   0  173472   5624   3684 S   0.0   0.0   0:00.00 postgres
              

              I’ve got quite a bit of experience with postgres; I don’t see any indication it’s the problem.

              • seang96A
                link
                fedilink
                English
                arrow-up
                1
                ·
                9 months ago

                Formatting the output is a bit bad with markdown in Lemmy haha wasn’t able to really see how much your using in postgres My list was from kurbenetes output.

                Anyways, with knowing you have 2 proxies before reaching your service I’d suggest reducing it to 1, it may reduce the round trip by a small amount but any helps and it simplifies your stack. You can also split Lemmy up into multiple microservices since I can see benefits on this in multiple ways since they will have their own dedicated usage, logs will be separate, and they would be under less load individually.

                Providing postgres configuration could be helpful too maybe in a pastebin? It’s a little difficult to find where the bottleneck in is since everyone has a different setup. Lemmy also has a matrix admin support group that can help troubleshoot too.

      • seang96A
        link
        fedilink
        English
        arrow-up
        1
        ·
        9 months ago

        You have it proxied twice? That is something I had at one point too but resolved and it reduced timeouts too. It adds another hop before getting to your Lemmy instance and makes it more complicated so I’d recommend reducing that. Are you doing this with k8s and one proxy is ingress while the other proxy is so you can get the proper configs?