Spyke
asklemmy·Ask LemmybyFarmdude

Can we trust LLM CALCULATIONS?.

Ok, you have a moderately complex math problem you needed to solve. You gave the problem to 6 LLMS all paid versions. All 6 get the same numbers. Would you trust the answer?

View original on lemmy.world

short answer: no.

Long Answer: They are still (mostly) statisics based and can't do real math. You can use the answers from LLMs as starting point, but you have to rigerously verify the answers they give.

70

The whole "two r's in strawberry" thing is enough of an argument for me. If things like that happen at such a low level, its completely impossible that it wont make mistakes with problems that are exponentially more complicated than that.

29

The problem with that is that it isn't actually counting the R's.

You'd probably have better luck asking it to write a script for you that returns the number of instances of a letter in a string of text, then getting it to explain to you how to get it running and how it works. You'd get the answer that way, and also then have a script that could count almost any character and text of almost any size.

That's much more complicated, impressive, and useful, imo.

8

A calculator as a tool to a llm though, that works, at least mostly, and could be better when kinks get worked out.

2
suppo.fi

LLMs don't and can't do math. They don't calculate anything, that's just not how they work. Instead, they do this:

2 + 2 = ? What comes after that? Oh, I remember! It's '4'!

It could be right, it could be wrong. If there's enough pattern in the training data, it could remember the correct answer. Otherwise it'll just place a plausible looking value there (behavior known as AI hallucination). So, you can not "trust" it.

30

Some are just realistic to the point of being correct. It frightens me how many users have no idea about any of that.

8

A good one will interpret what you are asking and then write code, often python I notice, and then let that do the math and return the answer. A math problem should use a math engine and that's how it gets around it.

But really why bother, go ask wolfram alpha or just write the math problem in code yourself.

2

They don’t calculate anything

They calculate the statistical probability of the next token in an array of previous tokens

1

Actually no, they have some sort of "circuits" that approximate math, which is even more interesting imo. Still not reliable in the slightest, of course.

1
sopuli.xyz

Why would I bother?

Calculators exist, logic exists, so no... LLMs are a laughably bad fit for directly doing math, they are bullshit engines they cannot "store" a value without fundamentally exposing it to hallucinating tendencies which is the worst property a calculator could possibly have.

21

Why would I bother?

Because you want to have a single interface that accepts natural-language input and gives answers.

That doesn't mean that using an LLM as a calculator is a reasonable approach --- though a larger system that incorporates an LLM might be. But I think that the goal is very understandable. I have Maxima, a symbolic math package, on my smartphone and computers. It's quite competent at probably just about any sort of mathematical problem that pretty much any typical person might want to do. It costs nothing. But...you do need to learn something about the package to be able to use it. You don't have to learn much of anything that a typical member of the public doesn't already know to use a prompt that accepts natural-language input. And that barrier is enough that most people won't use it.

1
Farmdudereply
lemmy.world

It was about all six models getting the same answer from different accounts. I was testing it. Over a hundred each same numbers

0

Right so because LLMs are attrocious at actually precisely carrying out logic operations the solution was likely to just throw a normal calculator inside the AI, make the AI use the calculator and then turn around and handwave that the entire thing is AI.

So... you could just skip the bullshit and use a calculator, the AI just repackages the same answer with more boilerplate bullshit.

Wolfram Alpha is the non-bullshit version of this.

https://www.wolframalpha.com/

22

Would you trust six mathematicians who claimed to have solved a problem by intuition, but couldn’t prove it?

That’s not how mathematics works: if you have to “trust” the answer, it isn’t even math.

20

That wasn’t the question. The question was whether you should trust the number and the answer is no. It could be correct or it could be incorrect. There’s not enough data to determine it.

LLMs work as predictive models. If you ask 10 people to estimate the height of a tree, and 8/10 estimate that it’s 10 ft tall, 2/10 estimate that it’s 8 ft tall, the most likely LLM answer is that it’s 10 ft tall. It doesn’t matter that if you actually go and measure the tree that it’s actually 15 ft tall. The LLM will likely report 8

21

Here's an interesting post that gives a pretty good quick summary of when an LLM may be a good tool.

Here's one key:

Machine learning is amazing if:

  • The problem is too hard to write a rule-based system for or the requirements change sufficiently quickly that it isn't worth writing such a thing and,
  • The value of a correct answer is much higher than the cost of an incorrect answer.

The second of these is really important.

So if your math problem is unsolvable by conventional tools, or sufficiently complex that designing an expression is more effort than the answer is worth... AND ALSO it's more valuable to have an answer than it is to have a correct answer (there is no real cost for being wrong), THEN go ahead and trust it.

If it is important that the answer is correct, or if another tool can be used, then you're better off without the LLM.

The bottom line is that the LLM is not making a calculation. It could end up with the right answer. Different models could end up with the same answer. It's very unclear how much underlying technology is shared between models anyway.

For example, if the problem is something like, "here is all of our sales data and market indicators for the past 5 years. Project how much of each product we should stock in the next quarter. " Sure, an LLM may be appropriately close to a professional analysis.

If the problem is like "given these bridge schematics, what grade steel do we need in the central pylon?" Then, well, you are probably going to be testifying in front of congress one day.

14

I wouldn't bother. If I really had to ask a bot, Wolfram Alpha is there as long as I can ask it without an AI meddling with my question.

E: To clarify, just because one AI or six will get the same answer that I can independently verify as correct for a simpler question, does not mean I can trust it for any arbitrary math question even if however many AIs arrive at the same answer. There's often the possibility the AI will stumble upon a logical flaw, exemplified by the "number of rs in strawberry" example.

12

I NEED TO consult every LLM VIA TELEKINESIS QUANTUM ELECTRIC GRAVITY A AND B WAVE.

2
lemmy.world

Using a calculator or wolfram alpha or similar tools i don't trust the answer unless it passes a few sanity checks. Frequently I am the source of error and no LLM can compensate for that.

7
Farmdudereply
lemmy.world

It checked out. But, all six getting the same is likely incorrect?.

-3
lemmy.zip

Yes. All six are likely to be incorrect.

Similarly, you could ask a subtle quantum mechanics question to six psychologists, and all six may well give you the same answer. You still should not trust that answer.

The way that LLMs correlate and gather answers is particularly unsuited to mathematics.

Edit: I. Contrast, the average Psychologist is much more prepared to answer a quantum mechanics question, than an average LLM is to answer a math or counting question.

6

Don't know. I've never asked any of them a maths question.

How costly is it to be wrong? You seem to care enough to ask people on the Internet so it suggests that it's fairly costly. I'd not trust them.

5
EpeeGnomereply
feddit.online

If all 6 got the same answer multiple times, then that means that your query very strongly correlated with that reply in the training data used by all of them. Does that mean it's therefore correct? Well, no. It could mean that there were a bunch of incorrect examples of your query they used to come up with that answer. It could mean that the examples it's working from seem to follow a pattern that your problem fits into, but the correct answer doesn't actually fit that seemingly obvious pattern. And yes, there's a decent chance it could actually be correct. The problem is that the only way to eliminate those other still also likely possibilities is to actually do the problem, at which point asking the LLM accomplished nothing.

3

I think the best thing at this juncture is to ask an LLM WHAT THE TRUTH IS LOL

2

No. Dear God no. Llms are not computers. They are just prediction machines. They predict that the next value is probably this value. There is no actual math there.

6
qaz
lemmy.world

Most LLM's now call functions in the background. Most calculations are just simple Python expressions.

6

Yes. I was aware of that, but I was manipulated by an analog device

1
kbin.melroy.org

this is a really weird premise. doing the same thing on 6 models is just not worth it especially when wolfram alpha exists and is far more trustable and speedy

5
FaceDeerreply
fedia.io

If the LLMs are part of a modern framework I would expect that they should be calling out to Wolfram Alpha (or a similar specialized math-solver) via an API to get the answer for you, for that matter.

3
lemmy.world

Finally an intelligent comment. So many comments in here that don't realize most LLM's are bundled with calculators that just do the math.

2

Anti-AI sentiment is extremely strong in every part of the Fediverse I've seen so far, usually my comments get downvoted heavily even when I'm just describing factual details of how it works. I expect a lot of people simply don't bother after a while.

2
lemmy.world

no, once i tried to do binary calc with chat gpt and he keot giving me wrong answers. good thing i had sone unit tests around that part so realised quickly its lying

5

Yes more people need to realize it's just a search engine with natural language input and output. LLM output should at least include citations.

2

Just yesterday I was fiddling around with a logic test in python. I wanted to see how well deepseek could analyze the intro line to a for loop, it properly identified what it did in the description, but when it moved onto giving examples it contradicted itself and took 3 or 4 replies before it realized that it contradicted itself.

2
Farmdudereply
lemmy.world

But, if you ran, gave the problem to all the top models and got the same? Is it still likely an incorrect answer? I checked 6. I checked a bunch of times. Different accounts. I was testing it. I'm seeing if its possible with all that in others opinions I actually had to check over a hundred times each got the same numbers.

-3

They could get the right answer 9999 times out of 10000 and that one wrong answer is enough to make all the correct answers suspect.

2
mander.xyz

What if there is a popular joke that relies on bad math that happens to be your question. Then the alignment is understandable and no indication of accuracy. Why use a tool with known issues, and overhead like querying six, instead of using a decent tool like Wolfram alpha?

1

my use case was, i expect easier and simpler. so i was able to write automated tests to validate logic of incrementing specific parts of a binary number and found that expected test values llm produced were wrong.

so if its possible to use some kind of automation to verify llm results for your problem, you will be confident in your answer. but generally llms tend to make up shit and sound confident about it

1

Nope, language models by inherent nature, xannot be used to calculate. Sure theoretically you could have input parsed, with proper training, to find specific variables, input those to a database and have that data mathematically transformed back into language data.

No LLMs do actual math, they only produce the most likely output to a given input based on trained data. If I input: What is 1 plus 1?

Then given the model, most likely has trained repetition on an answer to follow that being 1 + 1 = 2, that will be the output. If it was trained on data that was 1 + 1 = 5, then that would be the output.

5
lemmy.world

I’ve used LLMs quite a few times to find partial derivatives / gradient functions for me, and I know it’s correct because I plug them into a gradient descent algorithm and it works. I would never trust anything an LLM gives blindly no matter how advanced it is, but in this particular case I could actually test the output since it's something I was implementing in an algorithm, so if it didn't work I would know immediately.

5

That's rad, dude. I wish I knew how to do that. Hey, dude I imagined a cosmological model that fits the data with two fewer parameters then the standard model. Planke data. I I've checked the numbers, but I don't have the credentials. I need somebody to check it out. This is a it and a verbal explanation for the model by Academia.edu. It's way easier to listen first before looking. I don't want recognition or anything. Just for someone to review it. It's a short paper. https://youtu.be/_l8SHVeua1Y

2

Well, I wanted to know the answer and formula for future value of a present amount. The AI answer that came up was clear, concise, and thorough. I was impressed and put the formula into my spreadsheet. My answer did not match the AI answer. So I kept looking for what I did wrong. Finally I just put the value into a regular online calculator and it matched the answer my spreadsheet was returning.

So AI gave me the right equation and the wrong answer. But it did it in a very impressive way. This is why I think it's important for AI to only be used as a tool and not a replacement for knowledge. You have to be able to understand how to check the results.

4

Never a base model, absolutely with an agent and function calling with a properly made tool and retrieval.

3

no, LLM's are designed to drive up user engagement nothing else, it's programmed to present what you want to hear not actual facts. plus it's straight up not designed to do math

3

No lol. I don't trust a calculator to write me text and not a auto complete to solve me math problems

3

You cannot trust LLMs. Period.

They are literally hallucination machines that just happen to be correct sometimes.

3

Maybe? I'd be looking all over for some convergent way to fuck it up, though.

If it's just one model or the answers are only close, lol no.

3

For practice yeah as there is usually something you can do to verify the value. For study no as you would not learn shit.

2

That's why you ask 6 of them, and of they all come to the same conclusion then chances are it's either right, or a common pitfall.

2
fedia.io

How trustable the answer is depends on knowing where the answers come from, which is unknowable. If the probability of the answers being generated from the original problem are high because it occurred in many different places in the training data, then maybe it's correct. Or maybe everyone who came up with the answer is wrong in the same way and that's why there is so much correlation. Or perhaps the probability match is simply because lots of math problems tend towards similar answers.

The core issue is that the LLM is not thinking or reasoning about the problem itself, so trusting it with anything is more assuming the likelihood of it being right more than wrong is high. In some areas this is safe to do, in others it's a terrible assumption to make.

2
Farmdudereply
lemmy.world

I'm a little confused after listening to a podcast with.... Damn I can't remember his name. He's English. They call him the godfather of AI. A pioneer.

Well, he believes that gpt 2-4 were major breakthroughs in artificial infection. He specifically said chat gpt is intelligent. That some type of reasoning is taking place. The end of humanity could come in a year to 50 years away. If the fella who imagined a Neural net that is mapped using the human brain. And this man says it is doing much more. Who should I listen too?. He didn't say hidden AI. HE SAID CHAT GPT. HONESTLY ON OFFENSE. I JUST DON'T UNDERSTAND THIS EPIC SCENARIO ON ONE SIDE AND TOTALLY NOTHING ON THE OTHER

-2

Anyone with a stake in the development of AI is lying to you about how good models are and how soon they will be able to do X.

They have to be lying because the truth is that LLMs are terrible. They can't reason at all. When they perform well on benchmarks its because every benchmark contains questions that are in the LLMs training data. If you burn trillions of dollars and have nothing to show, you lie so people keep giving you money.

https://arxiv.org/html/2502.14318

However, the extent of this progress is frequently exaggerated based on appeals to rapid increases in performance on various benchmarks. I have argued that these benchmarks are of limited value for measuring LLM progress because of problems of models being over-fit to the benchmarks, lack real-world relevance of test items, and inadequate validation for whether the benchmarks predict general cognitive performance. Conversely, evidence from adversarial tasks and interpretability research indicates that LLMs consistently fail to learn the underlying structure of the tasks they are trained on, instead relying on complex statistical associations and heuristics which enable good performance on test benchmarks but generalise poorly to many real-world tasks.

5
Rhaedasreply
fedia.io

One step might be to try and understand the basic principles behind what makes a LLM function. The Youtube channel 3blue1brown has at least one good video on transformers and how they work, and perhaps that will help you understand that "reasoning" is a very broad term that doesn't necessarily mean thinking. What is going on inside a LLM is fascinating and amazing in what does manage to come out that's useful, but like any tool it can't be used for everything well, if at all.

2

Funny, but also not a bad idea, as you can ask it to clarify on things as you go. I just reference that YT channel because he has a great ability to visually show things to help them make sense.

1

If you really want to see how good they are, have them do the full calculation for 30! (30 factorial) And see how close it gets to real numbers.

1

@[email protected]

Short: no, I'd check it myself.

While LLMs must work with numbers under the hood, these algorithms don't deal with numbers directly, so they're unable to do any sort of direct calculation.

What they could do is indirect calculation through memorization: something similar to how we promptly answer "2+2" or "3*5" without needing to go through all the effort of calculating it, just by memory. For LLMs, it's "pre-fill" phase followed by "decode" phase, basically filling what comes after the pre-fill ("What is 2+2?" => "The answer is 4"). This is highly dependent on the training dataset, so if training dataset is poisoned or inaccurate, the output will be equally poisoned or inaccurate (e.g. they were trained only on joking corpora where "2+2=5" is repeated, they're very likely to output "The result of 2+2 is 5").

Now, where LLMs could be a bit more reliable (but not that reliable) is finding links between two or more words (again, highly reliant on their training data), as in the famous example "King - man + woman = ?": their output will correctly bring the token "Queen". It's called "Word embedding".

Similar words (i.e. words with similar meaning) will be closer inside semantic space, such "red", "rojo", "vermelho" and "rouge", so this means that LLMs have a "proto-languageless" conceptualization of words (to a certain extent, it's an approximated model of how words across multiple languages are wired within the brain of a polyglot human), yet they're unable to access it (just like the said polyglot can't verbally express the "transcendental concept" behind "red" and "rojo" and "vermelho" without using something within a language.... it's Saussurean Structuralism in practice, where one can't express a signified without a signifier: trust me, I tried...).

In a nutshell, humans managed to build calculators that can't calculate, except for (maybe) the proximity between two or more sequences of letters ("strings" or "tokens"), but they're unable to express this proximity in its raw form (numbers), they must use another sequence of letters to express it (the signifier).

Allow me to provoke back: can we trust human calculations? Are our brains that much different from a large language model, especially in situations where we got no tool (calculators and/or papers) to mechanically look up the answer, solely relying on our inner monologue and/or our proprioception? I can make the scenario worse: with no proprioception allowed (so no finger-counting, maybe the person is physically tied and blindfolded), could one be trusted when they spell out the result of a calculation conveyed just by using their inner monologue and/or visual mind? In this scenario, how's that different from the inference from a LLM?

How do you define 'real'? If you're talking about what you can feel, what you can smell, what you can taste and see, then 'real' is simply electrical signals interpreted by your brain. (Morpheus, The Matrix)

1

Probably, depending on the context. It is possible that all 6 models were trained on the same misleading data, but not very likely in general.

Number crunching isn't an obvious LLM use case, though. Depending on the task, having it create code to crunch the numbers, or a step-by-step tutorial on how to derive the formula, would be my preference.

-1