The article discusses the mysterious nature of large language models and their remarkable capabilities, focusing on the challenges of understanding why they work. Researchers at OpenAI stumbled upon unexpected behavior while training language models, highlighting phenomena such as “grokking” and “double descent” that defy conventional statistical explanations. Despite rapid advancements, deep learning remains largely trial-and-error, lacking a comprehensive theoretical framework. The article emphasizes the importance of unraveling the mysteries behind these models, not only for improving AI technology but also for managing potential risks associated with their future development. Ultimately, understanding deep learning is portrayed as both a scientific puzzle and a critical endeavor for the advancement and safe implementation of artificial intelligence.

  • kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    4 months ago

    It’s really so much worse than this article even suggests.

    For example, one of the things it doesn’t really touch on is the unexpected results emerging over the last year that a trillion parameter network may develop capabilities which can then be passed on to a network with less than a hundredth the parameter size by generating synthetic data from the larger model to feed into the smaller. (I doubt even a double digit percentage of researchers would have expected that result before it showed up.)

    Even weirder was a result that CoT prompting models to improve their answers and then feeding the questions and final answers into a new model but without the ‘chain’ from the CoT will still train the second network in the content of the chain.

    The degree to which very subtle details in the training data is ending up modeled seems to go beyond even some of the wilder expectations by researchers right now. Just this past week I saw a subtle psychological phenomenon I used to present about appearing very clearly and very by the book in GPT-4 outputs given the correct social context. I didn’t expect that to be the case for at least another generation or two of models and hadn’t expected the current SotA models to replicate it at all.

    For the first time two weeks ago I saw a LLM code switch to a different language when there was a more fitting translation to the concept being discussed. There’s no way the most statistical likelihood of discussing motivations in English was to drop into a language barely represented in English speaking countries. This was with the new Gemini, which also seems to have internalized a bias towards symbolic representations in its generation, to the point they appear to be filtering out emojis (in the past I’ve found examples where switching from nouns to emojis improves critical reasoning abilities of models as it breaks token similarity patterns in favor of more abstracted capabilities).

    Adding the transformer’s self attention to diffusion models has suddenly resulted in correctly simulating things like fluid dynamics and physics in Sora’s video generation.

    We’re only just starting to unravel some of the nuances of self-attention, such as recognizing the attention sinks in the first tokens and the importance of preserving them across larger sliding context windows.

    For the last year at least, especially after GPT-4 leapfrogged expectations, it’s very much been feeling as the article states - this field is eerily like the early 20th century in Physics where experimental results were regularly turning a half century of accepted theories on their head and fringe theories generally dismissed were suddenly being validated by multiple replicated results.

  • Redacted@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    3
    ·
    edit-2
    4 months ago

    This article, along with others covering the topic, seem to foster an air of mystery about machine learning which I find quite offputting.

    Known as generalization, this is one of the most fundamental ideas in machine learning—and its greatest puzzle. Models learn to do a task—spot faces, translate sentences, avoid pedestrians—by training with a specific set of examples. Yet they can generalize, learning to do that task with examples they have not seen before.

    Sounds a lot like Category Theory to me which is all about abstracting rules as far as possible to form associations between concepts. This would explain other phenomena discussed in the article.

    Like, why can they learn language? I think this is very mysterious.

    Potentially because language structures can be encoded as categories. Any possible concept including the whole of mathematics can be encoded as relationships between objects in Category Theory. For more info see this excellent video.

    He thinks there could be a hidden mathematical pattern in language that large language models somehow come to exploit: “Pure speculation but why not?”

    Sound familiar?

    models could seemingly fail to learn a task and then all of a sudden just get it, as if a lightbulb had switched on.

    Maybe there is a threshold probability of a positied association being correct and after enough iterations, the model flipped it to “true”.

    I’d prefer articles to discuss the underlying workings, even if speculative like the above, rather than perpetuating the “It’s magic, no one knows.” narrative. Too many people (especially here on Lemmy it has to be said) pick that up and run with it rather than thinking critically about the topic and formulating their own hypotheses.

    • orclev@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      1
      ·
      4 months ago

      Yeah pretty much this. My understanding of the way LLMs function is that they operate on statistical associations of words which would amount to categories in Category Theory. Basically the training phase is classifying words into categories based on the examples in the training input. Then when you feed it a prompt it just uses those categories to parse and “solve” your prompt. It’s not “mysterious” it’s just opaque because it’s an incredibly complicated model. Exactly the sort of thing that people are really bad at working with, but which computers are really good with.