A guide to local coding models

(aiforswes.com)

518 points | by mpweiher 19 hours ago

51 comments

simonw 18 hours ago
> I realized I looked at this more from the angle of a hobbiest paying for these coding tools. Someone doing little side projects—not someone in a production setting. I did this because I see a lot of people signing up for $100/mo or $200/mo coding subscriptions for personal projects when they likely don’t need to.
Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
[-]
- kristopolous 14 hours ago
  I use local models + openrouter free ones.
  My monthly spend on ai models is < $1
  I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually
  I'll basically do
```
    $ man tool | <how do I do this with the tool>
```
  or even
```
    $ cat source | <find the flags and give me some documentation on how to use this>
```
  Things I used to do intensively I now do lazily.
  I've even made a IEITYuan/Yuan-embedding-2.0-en database of my manpages with chroma and then I can just ask my local documentation how I do something conceptually, get the man pages, inject them into local qwen context window using my mansnip llm preprocessor, forward the prompt and then get usable real results.
  In practice it's this:
```
    $ what-man "some obscure question about nfs" 
    ...chug chug chug (about 5 seconds)...

    <answer with citations back to the doc pages>
```
  Essentially I'm not asking the models to think, just do NLP and process text. They can do that really reliably.
  It helps combat a frequent tendency for documentation authors to bury the most common and useful flags deep in the documentation and lead with those that were most challenging or interesting to program instead.
  I understand the inclination it's just not all that helpful for me
  [-]
  - nl 12 hours ago
    This is a completely different thing to AI coding models.
    If you aren't using coding models you aren't ahead of the curve.
    There are free coding models. I use them heavily. They are ok but only partial substitutes for frontier models.
    [-]
    - kristopolous 2 hours ago
      I'm extremely familiar with them.
      Some people, with some tasks, get great results
      But me, with my tasks, I need to maintain provenance and accountability over the code. I can't just have AI fly by the seat of its pants.
      I can get into lots of detail on this. If you have seen tools and setups I have done you'd realize why it doesn't work for me.
      I've spent money, the results for me, with my tasks, have not been the right decision.
  - alfonsodev 3 hours ago
    I use llm from command line too, time to time, is just easier to do
    llm 'output a .gitignore file for typical python project that I can pipe into the actual file ' > .gitignore
  - m4ck_ 13 hours ago
    Is your RAG manpages thing on github somewhere? I was thinking about doing something like that (it's high on my to-do list but I haven't actually done anything with llms yet.)
    [-]
    - kristopolous 10 hours ago
      I'll get it up soon, probably should. This little snippet will help you though:
      $ man --html="$(which markitdown)" <man page>
      That goes man -> html -> markdown which is not only token efficient but also llms are pretty good at creating hierarchies from markdown
      [-]
      - r-w 9 hours ago
        I bet you could do the same thing with pandoc and skip serializing to HTML entirely.
        [-]
        mkesper 8 hours ago
        Apparently yes: https://pandoc.org/MANUAL.html#options
    - scottyeager 7 hours ago
      Not the OP, but I did release my source :D https://github.com/scottyeager/Pal
      My tool can read stdin, send it to an LLM, and do a couple nice things with the reply. Not exactly RAG, but most man pages fit into the context window so it's okay.
  - aquafox 9 hours ago
    > I'll basically do
```
    $ man tool | <how do I do this with the tool>
```
    or even $ cat source | <find the flags and give me some documentation on how to use this>
    Could you please elaborate on this? Do I get this right that you can set up your your command line so that you can pipe something to a command that sends this something together with a question to an LLM? Or did you just mean that metaphorically? Sorry if this is a stupid question.
    [-]
    - mr_mitm 5 hours ago
      Yes, I use simonw's `llm` for that: https://github.com/simonw/llm
      Example:
      $ man tar | llm "how do I extract test.txt from a tar.gz"
    - scottyeager 8 hours ago
      I'm not the OP, but I did build a tool that I use in the same way: https://github.com/scottyeager/Pal
      Actually for many cases the LLM already knows enough. For more obscure cases, piping in a --help output is also sometimes enough.
    - __m 8 hours ago
      i guess op means: $ man tool | ai <how do I do this with the tool>
      where ai could be a simple shell script combining the argument with stdin
  - fragmede 1 hour ago
    > My monthly spend on ai models is < $1
    > I'm not cheap
    You're cheap. It's okay. We're all developers here. It's a safe space.
    [-]
    - mathgeek 1 hour ago
      While I say this somewhat in jest, frugal is just cheap but with better value.
- Aurornis 13 hours ago
  The limits for the $20/month plan can be reached in 10-20 minutes when having it explore large codebases with directed. It’s also easy to blow right through the quota if you’re not managing content well (waiting until it fills up and then auto-compacting, or even using /compact frequently instead of /clear or the equivalent in different tools).
  For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.
  But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.
  [-]
  - sheepscreek 12 hours ago
    Not for Codex. Not even for Gemini/Antigravity! I am truly shocked by how much mileage I can get out of them. I recently bought the $200/mo OpenAI subscription but could barely use 10% of it. Now for over a month, I use codex for at least 2 hrs every day and have yet to reach the quota.
    With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 once you hit your Gemini quota, and Google is waaaay more generous than Claude. I can use Opus alone for the entire coding session. It is bonkers.
    So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).
    The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.
    My advise - get Gemini + Codex lowest tier subscriptions. Add some credits to your codex subscription in case you hit the quota and can’t wait. You’ll never be spending over $100 even if you’re building complex apps like me.
    [-]
    - Aurornis 11 hours ago
      > I recently bought the $200/mo OpenAI subscription but could barely use 10% of it
      This entire comment is confusing. Why are you buying the $200/month plan if you’re only using 10% of it?
      I rotate providers. My comment above applies to all of them. It really depends on the work you’re doing and the codebase. There are tasks where I can get decent results and barely make the usage bar move. There are other tasks where I’ve seen the usage bar jump over 20% for the session before I get any usable responses back. It really depends.
      [-]
      - sheepscreek 8 hours ago
        I got it to try Atlas, their agentic browser, before it was open to Plus users. I convinced myself that I could use the additional capacity to multi-task and push through hard core problems without worrying about quota limits.
        For context, this was a few months ago when GPT 5 was new and I was used to constantly hitting o3 limits. It was an experiment to see if the higher plan could pay for itself. It most certainly can but I realized that I just don’t need it. My workflow has evolved into switching between different agents on the same project. So now I have much less of a need for any one.
        [-]
        wahnfrieden 7 hours ago
        To use up the Pro tier plan you must close the loop so to speak - so that Codex knows how to test the quality of its output and incrementally inch toward its goals. This can be harder or easier depending on your project.
        You should also queue up many "continue ur work" type messages.
        [-]
        sheepscreek 2 hours ago
        I’m actively doing that for a fun side project - systematically rewriting SQLite in Rust. The goal is to preserve 100% compatibility, quirks and all. First I got it to run the native test harness, and now it’s basically doing TDD by itself. Have to say, with regular check-ins, it works quite well.
        Note: I’m using the $20 plan for this! With codex-5.2-medium most of the time (previously codex-5.1-max-medium). For my work projects, Gemini 3 and Antigravity Claude Opus 4.5 are doing the heavy lifting at the moment, which frees up codex :) I usually have it running constantly in a second tab.
        The only way I can now justify Pro is if I am developing multiple parallel projects with codex alone. But that isn’t the case for me. I am happier having a mix of agents to work with.
      - selcuka 11 hours ago
        Not the same poster, but apparently they tried the $200/mo subscription, but after seeing they don't need it, they "subscribed to all three at their lowest subscriptions (for $60/mo)" instead.
        [-]
        Aurornis 10 hours ago
        > but apparently they tried the $200/mo subscription, but after seeing they don't need it
        This is why it’s confusing, though. Why start with the highest plan as the starting point when it’s so easy to upgrade?
        [-]
        1over137 9 hours ago
        Because you’re rich?
        [-]
        sheepscreek 8 hours ago
        Not rich. I pay in Canadian dollars :(
        I’m just a simple dude trying to optimize his life.
    - nl 11 hours ago
      I do the same and agree this works well.
      It's worth noting that the Claude subscription seems notably less than the others.
      Also there are good free options for code review.
    - jjromeo 5 hours ago
      Can confirm this is the way right now
  - JamesSwift 1 hour ago
    That has not been my experience with sonnet, and even so it is largely remedied by having better AI docs caching the results of that investigation for future use.
  - stuaxo 5 hours ago
    You'd think local models could explore a codename and build up a knowledge graph of it they could use to query it.
    It could take longer, but save your subscription tokens.
- uneekname 16 hours ago
  Yes, we are doing that. These tools help make my personal projects come to life, and the money is well worth it. I can hit Claude Code limits within an hour, and there's no way I'm giving OpenAI my money.
  [-]
  - _delirium 15 hours ago
    As a third option, I've found I can do a few hours a day on the $20/mo Google plan. I don't think Gemini is quite as good as Claude for my uses, but it's good enough and you get a lot of tokens for your $20. Make sure to enable the Gemini 3 preview in gemini-cli though (not enabled by default).
    [-]
    - deaux 13 hours ago
      Huge caveat: For the $20/mo subscription Google hasn't made clear if they train on your data. Anthropic and OAI on the other hand either clearly state they don't train on paid usage or offer very straightforward opt-outs.
      https://geminicli.com/docs/faq/
      > What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?
      > To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.
      > https://developers.google.com/gemini-code-assist/resources/p...
      The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.
      This also tracks with the fact that they enormously cripple the Gemini app if you turn off "apps activity" even for paying users.
      If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.
      [-]
      - versteegen 5 hours ago
        I have no idea at all whether the GCP "Service Specific Terms" [1] apply to Gemini CLI, but they do apply to Gemini used via Github Copilot [2] (the $10/mo plan is good value for money and definitely doesn't use your data for training), and states:
        Service Terms 17. Training Restriction. Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.
        [1] https://cloud.google.com/terms/service-terms
        [2] https://docs.github.com/en/copilot/reference/ai-models/model...
        [-]
        ayewo 4 hours ago
        Thanks for those links. GitHub Copilot looks like a good deal at $10/mo for a range of models.
        I originally thought they only supported the previous generation models i.e. Claude Opus 4.1 and Gemini 2.5 Pro based on the copy on their pricing page [1] but clicking through [2] shows that they support far more models.
        [1] https://github.com/features/copilot#pricing
        [2] https://github.com/features/copilot/plans#compare
      - w23j 5 hours ago
        That's the main reason, why I hope Google does not win this AI war.
      - lostmsu 1 hour ago
        Are you sure about OpenAI? I thought they actually do retain your agent chats (training I am less concerned about personally).
        Anthropic has an option to opt out of training and delete the chats from their cloud in 30 days.
      - _delirium 12 hours ago
        That's good to know, thanks. In my case nearly 100% of my code ends up public on GitHub, so I assume everyone's code models are training on it anyway. But would be worth considering if I had proprietary codebases.
- wyre 17 hours ago
  Me. Currently using Claude Max for personal coding projects. I've been on Claude's $20 plan and would run out of tokens. I don't want to give my money to OpenAI. So far these projects have not returned their value back to me, but I am viewing it as an investment in learning best pratices with these coding tools.
  [-]
  - ssss11 6 hours ago
    Me too. I couldn’t build an app that I hope to publish with the $20 plan. The sunk cost will either be reaped back once live, or it’s truly sunk and I’ll move on…..
- satvikpendem 17 hours ago
  > If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic.
  > The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
  These are the same people, by and large. What I have seen is users who purely vibe code everything and run into the limits of the $20/m models and pay up for the more expensive ones. Essentially they're trading learning coding (and time, in some cases, it's not always faster to vibe code than do it yourself) for money.
  [-]
  - maddmann 17 hours ago
    If this is the new way code is written then they are arguably learning how to code. Jury is still out though, but I think you are being a bit dismissive.
    [-]
    - satvikpendem 14 hours ago
      I wouldn't change definitions like that just because the technology changed, I'm talking about the ability to analyze control flow and logic, not necessarily put code on the screen. What I've seen from most vibe coders is that they don't fully understand what's going on. And I include myself, I tried it for a few months and the code was such garbage after a while that I scrapped it and redid it myself.
    - dns_snek 6 hours ago
      Absolutely not. They're not writing code or performing most of the work that programmers do, therefore they're not [working as] programmers. Their work ends up producing code, but they're not coders any more than my manager is.
      A "vibecoder" is to a programmer what script kiddie is to a hacker.
  - cmrdporcupine 15 hours ago
    I've been a software developer for 25 years, and 30ish years in the industry, and have been programming my whole life. I worked at Google for 10 of those years. I work in C++ and Rust. I know how to write code.
    I don't pay $100 to "vibe code" and "learn to program" or "avoid learning to program."
    I pay $100 so I can get my personal (open source) projects done faster and more completely without having to hire people with money I don't have.
    [-]
    - codetiger 12 hours ago
      Came here to write something similar (Of course, other than working in Google) and saw your comments reflecting my views. Yes, Its worth pending $200/month on Claude to get my personal project ideas come to life with better quality and finish.
    - satvikpendem 14 hours ago
      I'm talking about the general trend, not the exceptions. How much of the code do you manually write with the 100 dollar subscription? Vibe coding is a descriptive, not a prescriptive, label.
      [-]
      - cmrdporcupine 13 hours ago
        "How much of the code do you manually write"
        I review all of it, but hand write little of it. It's bizarre how I've ended up here, but yep.
        That said, I wouldn't / don't trust it with something from scratch, I only trust it to do that because I built -- by hand -- a decent foundation for it to start from.
        [-]
        satvikpendem 10 hours ago
        Sure, you're like me, you're not a vibe coder by the actual definition then. Still, the general trend I see is that a lot of actual vibe coders do try to get their product working, code quality be damned. Personally, same as you, I stopped vibe coding and actually started writing a lot of architecture and code myself first then allowing the LLM to fill in the features so to speak.
        [-]
        kasey_junk 2 hours ago
        The issue is that your claim was that if you are using up tokens you are probably vibe coding.
        But I’ve not found that to be true at all. My actually engineered processes where I care the most is where I push tokens the hardest. Mostly because I’m using llms in many places in the sdlc.
        When I’m vibing it’s just a single agent sort of puttering along. It uses much fewer tokens.
    - beepbooptheory 14 hours ago
      Why would you ever hire someone to help with a personal open source project?
      [-]
      - wredcoll 12 hours ago
        Depends on if the goal is to solve a problem (by writing code) or the goal is to write code (maybe solving a problem)
      - cmrdporcupine 14 hours ago
        I wouldn't, but I can pay Claude
      - fragmede 5 hours ago
        because we want to support open source? Even if you're independence maximalist, you still pay other people in your life to do things for you at some point. If you've got the money and the desire but not the time, why does that not seem reasonable to you?
        [-]
        cmrdporcupine 2 minutes ago
        Frankly I almost consider it a duty to use these agents -- which have harvested en masse from open source software (including GPL!) without permission -- to produce open source / free software.
        Restoring a bit of balance to things.
- ncruces 15 hours ago
  What I find perplexing is the very respectful people that pay those subscriptions to produce clearly sub-par work I'm sure they wouldn't have done themselves.
  And when pressed on “this doesn't make sense, are you sure this works?” they ask the model to answer, it gets it wrong, and they leave it at that.
- mudkipdev 15 hours ago
  Claude's $20 plan should be renamed to "trial". Try Opus and you will reach your limit in 10 minutes. With Sonnet, if you aren't clearing the context very often, you'll hit it within a few hours. I'm sympathetic to developers who are using this as their only AI subscription because while I was working on a challenging bug yesterday I reached the limit before it had even diagnosed the problem and had to switch to another coding agent to take over. I understand you can't expect much from a $20 subscription, but the next jump up costing $80 is demotivating.
  [-]
  - kxrm 15 hours ago
    > Try Opus and you will reach your limit in 10 minutes.
    That hasn't been true with Opus 4.5. I usually hit my limit after an hour of intense sessions.
    [-]
    - deaux 13 hours ago
      Daily limit? Weekly limit? Hitting a weekly limit after an hour still doesn't seem very productive.
      [-]
      - throwthrowuknow 13 hours ago
        Session limit that resets after 5 hours timed from the first message you sent. Most people I’ve seen report between 1 to 2 hours of dev time using Opus 4.5 on the Pro plan before hitting it unless you’re feeding in huge files and doing a bad job of managing your context.
        [-]
        deaux 13 hours ago
        Okay, that sounds pretty reasonable for a $20 subscription.
        [-]
        throwthrowuknow 2 hours ago
        Yeah it’s really not too bad but it does get frustrating when you hit the session limit in the middle of something. I also add $20 of extra usage so I can finish up the work in progress cleanly and have Opus create some notes so we can resume when the session renews. Gotta be careful with extra usage though because you can easily use it up if the context is getting full so it’s best to try to work in small independent chunks and clear the context after each. It’s more work but helps both with usage and Opus performs better when you aren’t pushing the context window to the max.
  - throwthrowuknow 13 hours ago
    I half agree, but it should be called “Hobbiest” since that’s what it’s good for. 10 minutes is hyperbolic, I average 1h30m even when using plan mode first and front loading the context with dev diaries, git history, milestone documents and important excerpts from previous conversations. Something tells me your modules might be too big and need refactoring. That said, it’s a pain having to wait hours between sessions and jump when the window opens to make sure I stay on schedule and can get three in a day but that works ok for hobby projects since I can do other things in between. I would agree that if you’re using it for work you absolutely need Max so that should be what’s called the Pro plan but what can you do? They chose the names so now we just need to add disclaimers.
    [-]
    - lodovic 9 hours ago
      I actually get more mileage out of Claude using a Github Copilot subscription. The regular Claude Pro will give me an hour or up to 90 minutes max, before it reaches the cap. The Github version has a monthly limit for the Claude requests (100 "premium requests") which I find much easier to manage. I was about to switch to the max plan but this setup (both Claude pro and Github Copilot, costing 30 a month together) was just enough for my needs. With a bonus that I can try some of the other model offerings as well.
      [-]
      - throwthrowuknow 2 hours ago
        Good to hear that’s working. When I was using copilot before Opus 4.5 came out I found it didn’t perform as well as Claude Code but maybe it works better now with 4.5 and the latest improvements to VSCode. I’ll have to try it again.
      - ayewo 4 hours ago
        In practice, how does switching between Claude and GitHub Copilot work?
        1. Do you start off using the Claude Code CLI, then when you hit limits, you switch to the GitHub Copilot CLI to finish whatever it is you are working on?
        2. Or, you spend most of your time inside VSCode so the model switching happens inside an IDE?
        3. Or, you are more of a strict browser-only user, like antirez :)?
    - cdelsolar 1 hour ago
      The word is hobbyist btw, not that you're the source for this typo, it seems to have percolated downwards from the blog post through these comments.
  - socrateslee 3 hours ago
    Gemini 3 on Gemini CLI (free version) would meet quota limit for about 3-4 messages, but it will take much longer time since it responses pretty slow.
  - bdangubic 15 hours ago
    the only thing that matters is whether or not you are getting your money’s worth. nothing else matters. if claude is worth $100 or $200 per month to you, it is an easy decision to pay. otherwise stick with $20 or nothing
  - lelele 15 hours ago
    > With Sonnet, if you aren't clearing the context very often, you'll hit it within a few hours.
    Do you mean that users should start a new chat for every new task, to save tokens? Thanks.
    [-]
    - jfreds 15 hours ago
      Short answer is yes. Not only is it more token-friendly and potentially lower latency, it also prevents weird context issues like forgetting Rules, compacting your conversation and missing relevant details, etc.
      [-]
      - bitexploder 13 hours ago
        Yep. I have Claude snapshot to a markdown doc with key points and resume and iterate. Saves so many tokens.
    - stuaxo 5 hours ago
      Yes, it also helps keep it focused.
  - bubbi 3 hours ago
    [dead]
- joshribakoff 13 hours ago
  To me, it doesn’t matter how cheap open AI codex is because that tool just burns up tokens, trying to switch to the wrong version of node using NVM on my machine. It spirals in a loop and never makes progress, for me, no matter how explicitly or verbosely i prompt.
  On the other hand, Claude has been nothing but productive for me.
  I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need? I already assumed everyone starts on the lowest plan and quickly runs into session limits and then upgrades.
  Also coaching people on which paid plan to sign up for kinda has nothing to do with running a local model, which is what this article is about
  [-]
  - nineteen999 13 hours ago
    I spent about 45 mins trying to get both Claude and ChatGPT to help get Codex running on my machine (WSL2) and on a Linux NUC, they couldn't help me get it working so I gave up and went back to Claude.
  - c-hendricks 13 hours ago
    Why is an LLM trying to switch node versions?
    [-]
    - wredcoll 12 hours ago
      Because somewhere inside its little non-deterministic brain, the phrase "switch to node version xxx" was the most probable response to the previous context.
- bonsai_spool 15 hours ago
  I also pay for the $100 plan as a researcher in biology dealing with a fair amount of data analysis in addition to bench work.
  Incidentally, wondering if anyone has seen this approach of asking Claude to manage Codex:
  https://www.reddit.com/r/codex/comments/1pbqt0v/using_codex_...
- __mharrison__ 17 hours ago
  I'm convinced the $20 gpt plus plan is the best plan right now. You can use Codex with gpt5.2. I've been very impressed with this.
  (I also have the same MBP the author has and have used Aider with Qwen locally.)
  [-]
  - andix 16 hours ago
    From my personal experience it's around 50:50 between Claude and Codex. Some people strongly prefer one over the other. I couldn't figure out yet why.
    I just can't accept how slow codex is, and that you can't really use it interactively because of that. I prefer to just watch Claude code work and stop it once I don't like the direction it's taking.
    [-]
    - asabla 16 hours ago
      From my point of view, you're either choosing between instruction following or more creative solutions.
      Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
      Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
      I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
  - baq 17 hours ago
    bit the bullet this week and paid for a month of claude and a month of chatgpt plus. claude seems to have much lower token limits, both aggregate and rate-limited and GPT-5.2 isn't a bad model at all. $20 for claude is not enough even for a hobby project (after one day!), openai looks like it might be.
    [-]
    - InsideOutSanta 17 hours ago
      I feel like a lot of the criticism the GPT-5.x models receive only applies to specific use cases. I prefer these models over Anthropic's because they are less creative and less likely to take freedoms interpreting my prompts.
      Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.
      But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.
      [-]
      - deaux 13 hours ago
        A lot of the criticism from GPT-5.x models stems from the fact they're dog slow so you end up paying with your own time.
- didip 14 hours ago
  When you look at how capable Claude is, vs the salary of even a fresh graduate, combined with how expensive your time is… Even the maximum plan is a super good deal.
- Aeolun 4 hours ago
  > Are people really doing that?
  Sure am. Capacity to finish personal projects has tripled for a mere $200/month. Would purchase again.
- hamdingers 17 hours ago
  And as a hobbyist the time to sign up for the $20/month plan is after you've spent $20 on tokens at least a couple times.
  YMMV based on the kinds of side projects you do, but it's definitely been cheaper for me in the long run to pay by token, and the flexibility it offers is great.
  [-]
  - iOSThrowAway 17 hours ago
    I spent $240 in one week through the API and realized the $20/month was a no-brainer.
- minimaxir 16 hours ago
  Claude 4.5 Opus on Claude Code's $20 plan is funny because you get about 2-3 prompts on any nontrivial task before you hit the session limit.
  If I wasn't only using it for side projects I'd have to cough up the $200 out of necessity.
  [-]
  - port3000 7 hours ago
    Just get the $100 plan? (5X). I code most of the day and hit the 5-hour limit a couple of times a week, and never hit the weekly limit.
- smcleod 17 hours ago
  On a $20/mo plan doing any sort of agentic coding you'll hit the 5hr window limits in less than 20 minutes.
  [-]
  - simonw 16 hours ago
    With Codex it only happened to me once in my 4.5hr session here: https://simonwillison.net/2025/Dec/15/porting-justhtml/
    Claude Code is a whole lot less generous though.
    [-]
    - stuaxo 5 hours ago
      This is useful info.
      I havent tried agentic coding as I havent set it up in a container yet, and not going to yolo my system (doing stuff via chat and a utility to copy and paste directories and files got me pretty far over the last year and a half).
    - alostpuppy 13 hours ago
      For sure. On one project I kept using codex just to see where the wall was. Took a long time.
      [-]
      - deaux 13 hours ago
        It helps that Codex is so much slower than Anthropic models, a 4.5 hours Codex session might as well be a 2 hour Claude Code one. I use both extensively FWIW.
  - andix 17 hours ago
    It really depends. When building a lot of new features it happens quite fast. With some attention to context length I was often able to go for over an hour on the 20$ claude plan.
    If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
    [-]
    - allenu 16 hours ago
      I find that I use it on isolated changes where Claude doesn’t really need to access a ton of files to figure out what to do and I can easily use it without hitting limits. The only time I hit the 4-5 hour limit is when I’m going nuts on a prototype idea and vibe coding absolutely everything, and usually when I hit the limit, I’m pretty mentally spent anyway so I use it as a sign to go do something else. I suppose everyone has different styles and different codebases, but for me I can pretty easily stay under the limit without that it’s hard to justify $100 or $200 a month.
- asciii 12 hours ago
  > The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
  leo dicaprio snapping gif
  These kinds of articles should focus on use case because mileage may vary depending on maturity of idea, testing and host of other factors.
  If the app, service, or whatever is unproven, that's a sunk cost on macbook vs. 4 weeks to validate an idea which is a pretty long time.
  If the idea is sound then run it on macbook :)
- RickyLahey 4 hours ago
  depending on your usecase $200/mo is often not much for a coding tool if you're using it for commercial purposes
  in my experience cursor is nicer to work with the openai/anthropic cli tools
- SkyPuncher 14 hours ago
  Time is my limiting factor, especially on personal projects. To me, this makes any multiplying effect valuable.
  When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
- haritha-j 16 hours ago
  I’ve been using vs code copilot pro for a few months and never really had any issue, once you hit the limit for one model, you generally still have a bunch more models to choose from. Unless I was vibe coding massive amounts of code without looking to testing, it’s hard to imagine I will run out of all the available pro models.
  [-]
  - deaux 13 hours ago
    Copilot Pro works with a total requests budget rather than per-model limits unless something changed. Could you explain?
    [-]
    - haritha-j 4 hours ago
      Oh wow, you're absolutely correct. In my head i recall this being different, I think i've confused myself about either when I was trialling antigravity, or the system they had earlier in this year where you would get notifications that you've used up a given model, at least for a limited time. I feel like the latter was a thing, but you've now made me question my memory, so wouldn't swear by it.
- shepherdjerred 15 hours ago
  I pay $200/mo just for Claude Code. I used Cursor for a while and used something like $600 in credits in Nov.
- strangescript 13 hours ago
  this, provided you don't mind hopping around a lot, 5 20 dollar a month accounts will get you way more tokens typically, also good free models will show up from time to time on openrouter
- wahnfrieden 7 hours ago
  I regularly hit my limits on the $200/mo Codex plan (using medium reasoning). (I am using everything for production - these aren't toy ideas.)
- A4ET8a8uTh0_v2 4 hours ago
  Anecdata, buddy is paying claude for his personal stuff. But he is more brave about testing things in production as it were:D
- bottlepalm 14 hours ago
  When you pay $1000/month for health insurance and $2000/month for housing.. $200 for something you actually enjoy isn't so bad.
  [-]
  - tempsaasexample 13 hours ago
    Would you be homeless for 3 days a month so that you could have 30 days of AI?
    Not a serious question but I thought it's an interesting way of looking at value.
    I used to sell cars in SF. Some people wouldn't negotiate over $50 on a $500 a month lease because their apartment was $4k anyway.
    Other people WOULD negotiate over $50 because their apartment was $4k.
- cmrdporcupine 15 hours ago
  Codex $20 is a good deal but they have nothing inbetween $20 and $200.
  The $20 Anthropic plan is only enough to wet my appetite, I can't finish anything.
  I pay for $100 Anthropic plan, and keep a $20 Codex plan in my back pocket for getting it to do additional review and analysis overtop of what Opus cooks up.
  And I have a few small $ of misc credits in DeepSeek and Kimi K2 AI services mainly to try them out, and for tasks that aren't as complicated, and for writing my own agent tools.
  $20 Claude doesn't go very far.
  [-]
  - KronisLV 1 hour ago
    Idk why the gap is so big, surely a bunch of people would also pay 50$ a month across multiple vendors for medium amount of tokens.
    [-]
    - cmrdporcupine 1 hour ago
      Indeed I would consider switching to Codex completely if a) they had a $100 or $50 membership b) they really worked on improving the CLI tool a lot more. It's about 4-6 months behind Claude Code
- jwpapi 17 hours ago
  Not everybody is broke.
- CSMastermind 8 hours ago
  If you're a hobbyist doing a side project, I'd start with Google and use anti-gravity, then only move to OpenAI when the project gets too complex for Gemini to handle things.
bilater 23 minutes ago
If you are using local models for coding you are midwiting this. Your code should be worth more than a subscription.
The only legit use case for local models is privacy.
I don't know why anyone would want to code with an intern level model when they can get a senior engineer level model for a couple of bucks more.
It DOESN'T MATTER if you're writing a simple hello world function or building out a complex feature. Just use the f*ing best model.
[-]
- pcl 7 minutes ago
  [delayed]
- jgalt212 15 minutes ago
  I will use a local coding model for our proprietary / trade secrets internal code when Google uses Claude for its internal code and Microsoft starts using Gemini for internal code.
  The flip side of this coin is I'd be very excited if Jane Street or DE Shaw were running their trading models through Claude. Then I'd have access to billions of dollars of secrets.
jwr 7 hours ago
I am still hoping, but for the moment… I have been trying every 30-80B model that came out in the last several months, with crush and opencode, and it's just useless. They do produce some output, but it's nowhere near the level that claude code gets me out of the box. It's not even the same league.
With LLMs, I feel like price isn't the main factor: my time is valuable, and a tool that doesn't improve the way I work is just a toy.
That said, I do have hope, as the small models are getting better.
[-]
- DrAwdeOccarim 3 hours ago
  I use Opus 4.5 and GPT 5.2-Codex through VS Code all day long, and the closest I've come is Devstral-Small-2-24B-Instruct-2512 inferring on a DGX Spark hosting with vLLM as an "Open AI Compatible" API endpoint I use to power the Cline VS Code extension.
  It works, but it's slow. Much more like set it up and come back in an hour and it's done. I am incredibly impressed by it. There are quantized GGUFs and MLXs of the 123B, which can fit on my M3 36GB Macbook that I haven't tried yet.
  But overall, it feels about about 50% too slow, which blows my mind because we are probably 9 months away from a local model that is fast and good enough for my script kiddie work.
- larodi 6 hours ago
  Claude Code is a lot about prompting and orchestration of the conversation. The LLM is just a tool in these agentic frameworks. Whats truly ingenious is how context is engineered/managed, how is the code-RAG approached, and them LLM memory that is used.
  So my guess would be - we need open conversation or something along the line of "useful linguistic-AI approaches for combing and grooming code"
  [-]
  - jwr 5 hours ago
    Agreed. I've been trying to use opencode and crush, and none of them do anything useful for me. In contrast, claude code "just works" and does genuinely useful work. And it's not just because of the specific LLM used, it's the overall engineering of the tool, the prompt behind the scenes, etc.
    But the bottom line is that I still can't find a way to use either local LLMs and/or opencode and crush for coding.
Workaccount2 17 hours ago
I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.
Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...
Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)
[-]
- ekjhgkejhgk 17 hours ago
  I agree with everything you said, and yet I cannot help but respect a person who wants to do it himself. It reminds me of the hacker culture of the 80s and 90s.
  [-]
  - slicktux 16 hours ago
    Agreed, Everyone seems to shun the DIY hacker now a days; saying things like “I’ll just pay for it”. It’s not about just NOT paying for it but doing it yourself and learning how to do it so that you can pass the knowledge on and someone else can do it.
    [-]
    - davidw 16 hours ago
      I loathe the idea of being beholden to large corporations for what may be a key part of this job in the future.
      [-]
      - Eupolemos 6 hours ago
        And we all know that enshittyfication is coming.
        [-]
        ekjhgkejhgk 6 hours ago
        Exactly. Google doesn't show you what it knows is the most appropriate answer, it shows you a compromise between the most appropriate answer and the one that makes them the most money.
        Same thing will happen with these tools, just a matter of time.
- smcleod 17 hours ago
  My 2023 Macbook Pro (M2 Max) is coming up to 3 years old and I can run models locally that are arguably "better" than what was considered SOTA about 1.5 years ago. This is of course not an exact comparison but it's close enough to give some perspective.
  [-]
  - menaerus 2 hours ago
    OpenAI released GPT-4o in May 2024, and Anthropic released Claude 3.5 Sonnet in June 2024.
    I haven't tried the local models as much but I'd find it difficult to believe that they would outperform the 2024 models from OpenAI or Anthropic.
    The only major algorithmic shift was done towards the RLVR and I believe it was already being applied during the 2023-2024.
- wyldfire 13 hours ago
  Is that really the case? This summer there was "Frontier AI performance becomes accessible on consumer hardware within a year" [1] which makes me think it's a mistake to discount the open weights models.
  [1] https://epoch.ai/data-insights/consumer-gpu-model-gap
  [-]
  - hu3 12 hours ago
    Open weight models are neat.
    But for SOTA performance you need specialized hardware. Even for Open Weight models.
    40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.
    Your link starts with:
    > "Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just 6 to 12 months ago."
    I highly doubt a RTX 5090 can run anything that competes with Sonnet 3.5 which was released June, 2024.
    [-]
    - Lapel2742 5 hours ago
      > I highly doubt a RTX 5090 can run anything that competes with Sonnet 3.5 which was released June, 2024.
      I don't know about the capabilities of a 5090 but you probably can run a Devstral-2 [1] model locally on a Mac with good performance. Even the small Devstral-2 model (24b) seems to easily beat Sonnet 3.5 [2]. My impression is that local models have made huge progress.
      Coding aside I'm also impressed by the Ministral models (3b, 8b and 14b) Mistral AI released a a couple of weeks ago. The Granite 4.0 models by IBM also seem capable in this context.
      [1] https://mistral.ai/news/devstral-2-vibe-cli
      [2] https://www.anthropic.com/news/swe-bench-sonnet
    - menaerus 2 hours ago
      > 40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.
      For general purpose LLM probably yes. For something very domain-specialized not necessarily.
  - cmrdporcupine 11 hours ago
    With RAM prices spiking, there's no way consumers are going to have access to frontier quality models on local hardware any time soon, simply because they won't fit.
    That's not the same as discounting the open weight models though. I use DeepSeek 3.2 heavily, and was impressed by the Devstral launch recently. (I tried Kimi K2 and was less impressed). I don't use them for coding so much as for other purposes... but the key thing about them is that they're cheap on API providers. I put $15 into my deepseek platform account two months ago, use it all the time, and still have $8 left.
    I think the open weight models are 8 months behind the frontier models, and that's awesome. Especially when you consider you can fine tune them for a given problem domain...
- satvikpendem 17 hours ago
  > I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.
  Well, the hardware remains the same but local models get better and more efficient, so I don't think there is much difference between paying 5k for online models over 5 years vs getting a laptop (and well, you'll need a laptop anyway, so why not just get a good enough one to run local models in the first place?).
  [-]
  - Workaccount2 10 hours ago
    Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.
    Even still, right now is when the first gen of pure LLM focused design chipsets are getting into data centers.
    [-]
    - lelanthran 4 hours ago
      > Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.
      Unless you're YOLOing it, you can review only at a certain speed, and for a certain number of hours a day.
      The only tokens/s you need is one that can keep you busy, and I expect that even a slow 5token/sec model utilised 60s in every minute, 60m of every hour and 24 hours of every day is way more than you can review in a single working day.
      The goal we should be moving towards is longer-running tasks, not quicker responses, because if I can schedule 30 tasks to my local LLm before bed, then wake up in the morning and schedule a different 30, and only then start reviewing, then I will spend the whole day just reviewing while the LLM is generating code for tomorrow's review. And for this workflow a local model running 5 tokens/s is sufficient.
      If you're working serially, i.e. ask the LLM to do something, then review what it gave you, then ask it to do the next thing, then sure, you need as many tokens per second as possible.
      Personally, I want to move to long-running tasks and not have to babysit the thing all day, checking in at 5m intervals.
    - satvikpendem 10 hours ago
      At a certain point, tokens per second stop mattering because the time to review stays constant. Whether it shits out 200 tokens a second versus 20, it doesn't much matter if you need to review the code that does come out.
  - brulard 15 hours ago
    If you have inference running on this new 128GB RAM Mac, wouldn't you still need another separate machine to do the manual work (like running IDE, browsers, toolchains, builders/bundlers etc.)? I can not imagine you will have any meaningful RAM available after LLM models are running.
    [-]
    - satvikpendem 10 hours ago
      No? First of all you can limit how much of the unified RAM goes into VRAM, and second, many applications don't need that much RAM. Even if you put 108 GB to VRAM and 16 to applications, you'll be fine.
      [-]
      - brulard 3 hours ago
        How about the rest of the resources? CPU/GPU? Would your work not be affected by inference running?
- thefourthchime 10 hours ago
  I completely agree. I can't even imagine using a local model when I can barely tolerate a model one tick behind SOTA for coding.
- littlestymaar 4 hours ago
  > Local models are purely for fun, hobby, and extreme privacy paranoia
  I always find it funny when the same people who were adamant that GPT-4 was game-changer level of intelligence are now dismissing local models that are both way more competent and much faster than GPT-4 was.
- ekianjo 12 hours ago
  That's the kind of attitude that removes power from the end user. If everything becomes SAAS you don't control anything anymore.
embedding-shape 38 minutes ago
> because GPT-OSS frequently gave me “I cannot fulfill this request” responses when I asked it to build features.
This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.
Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?
Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.
bjt12345 6 hours ago
Here's my take on it though...
Just as we had the golden era of the internet in the late 90s, when the WWW was an eden of certificate-less homepages with spinning skulls on geocities without ad tracking, we are now in the golden era of agentic coding where massive companies make eye watering losses so we can use models without any concerns.
But this won't last and Local Llamas will become a compelling idea to use, particularly when there will be a big second hand market of GPUs from liquidated companies.
[-]
- sesm 1 hour ago
  Unfortunately, GPUs die in datacenters very quickly, and GPU manufacturers don't care about hardware longevity.
- aleggg 3 hours ago
  Yes. This heavily subsidized LLM inference usage will not last forever.
  We have already seen cost cutting for some models. A model starts strong, but over time the parent company switches to heavily quantized versions to save on compute costs.
  Companies are bleeding money, and eventually this will need to adjust, even for a behemoth like Google.
  That is why running local models is important.
- yread 3 hours ago
  Yep, when the tide goes away no company will be able to keep swimming naked offering stuff for free
raw_anon_1111 16 hours ago
I don’t think I’ve ever read an article where the reason I knew the author was completely wrong about all of their assumptions was that they admitted it themselves and left the bad assumptions in the article.
The above paragraph is meant to be a compliment.
But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.
Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?
https://news.ycombinator.com/item?id=46248644
[-]
- baq 8 hours ago
  Yes, it’s worth it, if only because that Mac will be worth $20k in 3 months…
  [-]
  - john_minsk 6 hours ago
    Do you think prices will go up for mac?
    [-]
    - kergonath 3 hours ago
      That comment was a joke, but still. Resale prices for Macs are quite high. I didn’t run the calculation but it is entirely plausible the TCO including resale over a couple of years is much less than $200/month, if that’s the alternative.
    - phrotoma 5 hours ago
      baq's comment is a joke about RAM prices.
simonw 18 hours ago
This story talks about MLX and Ollama but doesn't mention LM Studio - https://lmstudio.ai/
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
[-]
- ZeroCool2u 17 hours ago
  LMStudio is so much better than Ollama it's silly it's not more popular.
  [-]
  - thehamkercat 17 hours ago
    LMStudio is not open source though, ollama is
    but people should use llama.cpp instead
    [-]
    - smcleod 17 hours ago
      I suspect Ollama is at least partly moving away open source as they look to raise capitol, when they released their replacement desktop app they did so as closed source. You're absolutely right that people should be using llama.cpp - not only is it truly open source but it's significantly faster, has better model support, many more features, better maintained and the development community is far more active.
      [-]
      - calgoo 5 hours ago
        Only issue I have found with llama.cpp is trying to get it working with my amd GPU. Ollama almost works out of the box, in docker and directly on my Linux box.
      - parthsareen 10 hours ago
        Desktop app is open-source now.
    - nateb2022 15 hours ago
      > but people should use llama.cpp instead
      MLX is a lot more performant than Ollama and llama.cpp on Apple Silicon, comparing both peak memory usage + tok/s output.
      edit: LM Studio benefits from MLX optimizations when running MLX compatible models.
    - behnamoh 17 hours ago
      > LMStudio is not open source though, ollama is
      and why should that affect usage? it's not like ollama users fork the repo before installing it.
      [-]
      - thehamkercat 17 hours ago
        It was worth mentioning.
    - skhameneh 8 hours ago
      ik_llama is almost always faster when tuned. However, when untuned I've found them to be very similar in performance with varied results as to which will perform better.
      But vLLM and Sglang tend to be faster than both of those.
    - Abishek_Muthian 13 hours ago
      Besides optimizations specific to running locally lands in lamma.cpp first.
    - ekianjo 12 hours ago
      Ollama did not open source their GUI.
      [-]
      - jmorgan 11 hours ago
        The source is available here: https://github.com/ollama/ollama/tree/main/app
- midius 17 hours ago
  Makes me think it's a sponsored post.
  [-]
  - Cadwhisker 17 hours ago
    LMStudio? No, it's the easiest way to run am LLM locally that I've seen to the point where I've stopped looking at other alternatives.
    It's cross-platform (Win/Mac/Linux), detects the most appropriate GPU in your system and tells you whether the model you want to download will run within it's RAM footprint.
    It lets you set up a local server that you can access through API calls as if you were remotely connected to an online service.
    [-]
    - vunderba 17 hours ago
      FWIW, Ollama already does most of this:
      - Cross-platform
      - Sets up a local API server
      The tradeoff is a somewhat higher learning curve, since you need to manually browse the model library and choose the model/quantization that best fit your workflow and hardware. OTOH, it's also open-source unlike LMStudio which is proprietary.
      [-]
      - randallsquared 17 hours ago
        I assumed from the name that it only ran llama-derived models, rather than whatever is available at huggingface. Is that not the case?
        [-]
        fenykep 16 hours ago
        No, they have quite a broad list of models: https://ollama.com/search
        [edit] Oh and apparently you can also directly run some models directly from HuggingFace: https://huggingface.co/docs/hub/ollama
- thehamkercat 17 hours ago
  I think you should mention that LM Studio isn't open source.
  I mean, what's the point of using local models if you can't trust the app itself?
  [-]
  - rubymamis 7 hours ago
    You can always use something like Little Snitch to not allow it to dial home.
  - behnamoh 17 hours ago
    > I mean, what's the point of using local models if you can't trust the app itself?
    and you think ollama doesn't do telemetry/etc. just because it's open source?
    [-]
    - parthsareen 10 hours ago
      You're welcome to go through the source: https://github.com/ollama/ollama/
    - thehamkercat 17 hours ago
      That's why i suggested using llama.cpp in my other comment.
  - satvikpendem 17 hours ago
    Depends what people use them for, not every user of local models is doing so for privacy, some just don't like paying for online models.
    [-]
    - thehamkercat 17 hours ago
      Most LLM sites are now offering free plans, and they are usually better than what you can run locally, So I think people are running local models for privacy 99% of the time
- ekianjo 12 hours ago
  Lmstudio runs llama.cpp under the hood.
  [-]
  - selcuka 11 hours ago
    They also run the Apple MLX engine on macOS.
- evacchi 16 hours ago
  ramalama.ai is worth mentioning too
d4rkp4ttern 3 hours ago
I recently found myself wanting to use Claude Code and Codex-CLI with local LLMs on my MacBook Pro M1 Max 64GB. This setup can make sense for cost/privacy reasons and for non-coding tasks like writing, summarization, q/a with your private notes etc.
I found the instructions for this scattered all over the place so I put together this guide to using Claude-Code/Codex-CLI with Qwen3-30B-A3B, 80B-A3B, Nemotron-Nano and GPT-OSS spun up with Llama-server:
https://github.com/pchalasani/claude-code-tools/blob/main/do...
Llama.cpp recently started supporting Anthropic’s messages API for some models, which makes it really straightforward to use Claude Code with these LLMs, without having to resort to say Claude-Code-Router (an excellent library), by just setting the ANTHROPIC_BASE_URL.
NelsonMinar 17 hours ago
"This particular [80B] model is what I’m using with 128GB of RAM". The author then goes on to breezily suggest you try the 4B model instead of you only have 8GB of RAM. With no discussion of exactly what a hit in quality you'll be taking doing that.
[-]
- ethmarks 13 hours ago
  This is like if an article titled "A guide to growing your own food instead of buying produce" explained that the author was using a four-acre plot of farmland but suggested that that reader could also use a potted plant instead. Absolutely baffling.
cloudhead 18 hours ago
In my experience the latest models (Opus 4.5, GPT 5.2) Are _just_ starting to keep up with the problems I'm throwing at them, and I really wish they did a better job, so I think we're still 1-2 years away from local models not wasting developer time outside of CRUD web apps.
[-]
- OptionOfT 18 hours ago
  Eh, these things are trained on existing data. The further you are from that the worse the models get.
  I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.
  [-]
  - cloudhead 16 hours ago
    For sure, and I guess that's kind of my point -- if the OP says local coding models are now good enough, then it's probably because he's using things that are towards the middle of the distribution.
    [-]
    - dkdcio 14 hours ago
      similar for me —- also how do you get the proper double dashes —- anyway, I’d love to be able to run CLI agents fully local, but I don’t see it being good enough (relative to what you can get for pretty cheap from SOTA models) anytime soon
      [-]
      - cloudhead 6 hours ago
        What’s wrong with your keyboard haha
        [-]
        dkdcio 4 hours ago
        iphone :/ I see others with the same problem too, oh well, at least people won’t accuse me of being an LLM probably
andix 17 hours ago
I wouldn't run local models on the development PC. Instead run them on a box in another room or another location. Less fan noise and it won't influence the performance of the pc you're working on.
Latency is not an issue at all for LLMs, even a few hundred ms won't matter.
It doesn't make a lot of sense to me, except when working offline while traveling.
[-]
- snoman 16 hours ago
  Less of a concern these days with hardware like a Mac Studio or Nvidia dgx which are accessible and aren’t noisy at all.
  [-]
  - andix 18 minutes ago
    I'm not fully convinced that those devices don't create noise at full power. But one issue still remains: LLMs eating up compute on the device you're working on. This will always be noticeable.
throw-12-16 6 hours ago
I never see devs containerize their coding agents.
It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.
[-]
- onion2k 6 hours ago
  The devs I work with don't put secrets in their home directories. ;)
  [-]
  - throw-12-16 6 hours ago
    many many tools default to this, claude included
  - littlestymaar 6 hours ago
    And where are all their software putting their data then? Unless you consider only private keys to be secrets…
    (In particular the fact that Claude Code has access to your Anthropic API key is ironic given that Dario and Anthropic spend a lot of time fearmongering about how the AI could go rogue and “attempt to escape”).
SamDc73 16 hours ago
If privacy is your top priority, then sure spend a few grand on hardware and run everything locally.
Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)
I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind
That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.
[-]
- eblanshey 15 hours ago
  What hardware can you buy for $5k to be able to run K2? That's a huge model.
  [-]
  - SamDc73 11 hours ago
    This older HN thread shows R1 running on a ~$2k box using ~512 GB of system RAM, no GPU, at ~3.5-4.25 TPS: https://news.ycombinator.com/item?id=42897205
    If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.
    [-]
    - nl 9 hours ago
      Is 4 TPS actually useful for anything?
      That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)
      I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.
      [-]
      - zarzavat 7 hours ago
        3.5-50M tokens a day? What are you doing with all those tokens?
        Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.
        [-]
        KronisLV 1 hour ago
        Here’s my own stats, for comparison: https://news.ycombinator.com/item?id=46216192
        Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.
        It works pretty well if you’re not limited by a tight budget.
    - BoredPositron 7 hours ago
      Stop recommending 3090s they are all but obsolete now. Not having native bf16 is a showstopper.
      [-]
      - qayxc 5 hours ago
        Hard disagree. The difference in performance is not something you'll notice if you actually use these cards. In AI benchmarks, the RTX 3090 beats the RTX 4080 SUPER, despite the latter having native BF16 support. 736GiB/s (4080) memory bandwidth vs 936 GiB/s (3090) plays a major role. Additionally, the 3090 is not only the last NVIDIA consumer card to support SLI.
        It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.
        You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.
        [-]
        BoredPositron 3 hours ago
        If you run fp32 maybe but no sane person does that. The tensor performance of the 3090 is also abysmal. If you run bf16 or fp8 stay away from obsolete cards. Its barely usable for llms and borderline garbage tier on video and image gen.
        [-]
        qayxc 1 hour ago
        Actual benchmarks show otherwise.
        > The tensor performance of the 3090 is also abysmal.
        I for one compared my 50-series card's performance to my 3090 and didn't see "abysmal performance" on the older card at all. In fact, in actual real-world use (quantised models only, no one runs big fp32 models locally), the difference in performance isn't very noticeable at all. But I'm sure you'll be able to provide actual numbers (TTFT, TPS) to prove me wrong. I don't use diffusion models, so there might be a substantial difference there (I doubt it, though), but for LLMs I can tell you for a fact that you're just wrong.
        [-]
        BoredPositron 3 minutes ago
        I don't use consumer cards. Benchmarks are out there (phoronix, runpod or from Nvidias own presentation) and they say it's at least 2x on high and nearly 4x on low precision, which is comparable to the uplift I see on my 6000 cards, if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.
nzeid 18 hours ago
I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
[-]
- a_victorp 18 hours ago
  If you ever do it, please make a guide! I've been toying with the same notion myself
  [-]
  - whitehexagon 46 minutes ago
    SimonW used to have more articles/guides on local LLM setup, at least until he got the big toys to play with, but well worth looking through his site. Although if you are in parts of Europe, the site is blocked at weekends, something to do with the great-firewall of streamed sports.
    https://simonwillison.net/
    Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.
    I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.
    I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.
  - suprjami 17 hours ago
    If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.
    Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
    Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
    For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
    If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
    [-]
    - le-mark 15 hours ago
      I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?
      [-]
      - hackstack 15 hours ago
        This is essentially vast.ai, no?
        [-]
        MrDrMcCoy 10 hours ago
        A quick glance at their homepage says they run in "secure datacenters", so no.
        [-]
        gkbrk 6 hours ago
        Then you glanced too quickly, vast.ai absolutely has non-datacenter GPUs.
        https://vast.ai/hosting#gpu-farms-homelabs
  - satvikpendem 17 hours ago
    Jeff Geerling has (not quite but sort of) guides: https://news.ycombinator.com/item?id=46338016
maranas 17 hours ago
Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work
[-]
- rynn 15 hours ago
  What are you working on that you’ve had such great success with gpt-oss?
  I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.
  But I’m open to trying again.
  [-]
  - embedding-shape 34 minutes ago
    > What are you working on that you’ve had such great success with gpt-oss?
    I'm doing programming on/off (mostly use Codex with hosted models) with GPT-OSS-120B, and with reasoning_effort set to high, it gets it right maybe 95% of the times, rarely does it get anything wrong.
  - maranas 15 hours ago
    I use it to build some side-projects, mostly apps for mobile devices. It is really good with Swift for some reason.
    I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own
ineedasername 15 hours ago
I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.
These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.
Roark66 4 hours ago
I found the winning combination is to use all of them in this way: - first you need a vendor agnostic tool like opencode (I had to add my own vendors as it didn't support it out of the box properly) - second you set up agents with different models. I use: - for architecture and planning - opus, Sonet, gpt 5.2, gemini3 (depending on specifics, for example I found got better in troubleshooting, Sonet better in pure code planning, opus better in DevOps, Gemini the best for single shot stuff) - for execution of said plans (Qwen 2.5 Coder 30B - yes, it's even better in my use cases than Qwen3 despite benchmarks, Sonet - only when absolutely necessary, Qwen3-235B - between Qwen 2.5 and Sonet) - verification (Gemini 3 flash, Qwen3-480B etc)
The biggest saving you make is by making the context smaller and where many turns are required going for smaller models. For example a single 30min troubleshooting session with Gemini 3 can cost $15 if you run it "normally" or it can cost $2 if you use the agents, wipe context after most turns (can be done thanks to tracking progress in a plan file)
amarant 10 hours ago
Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.
I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.
Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.
[-]
- kube-system 9 hours ago
  You need memory hooked up to the GPU. Apple’s unified memory is actually one of the cheaper ways to do this. On a typical x86-64 desktop, this means VRAM… for 100+ GB of VRAM you’re deep into tens of thousand of dollars.
  Also, if you think Apple’s RAM prices are crazy… you might be surprised at what current DDR5 pricing is today. The $800 that Apple charges to upgrade a MBP from 64-128GB is the current price of 64GB desktop DDR5-6000. Which is actually slower memory than the 8533 MT/s memory you’re getting in the MacBook.
- nl 9 hours ago
  You want unified RAM.
  On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.
  These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.
  You also have limited upgradeability anyway - the RAM is soldered.
- Renaud 10 hours ago
  Can any x86 based system actually comes with that much unified memory?
  Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.
  And now with Exo, you can even break the 512GB barrier.
fny 14 hours ago
My takeaway is that clock is ticking on Claude, Codex et al's AI monopoly. If a local setup can do 90% of what Claude can do today, what do things look like in 5 years?
[-]
- maranas 14 hours ago
  I think they have already realized this, which is why they are moving towards tool use instead of text generation. Also explains why there are no more free APIs nowadays (even for search)
- ukuina 14 hours ago
  Exactly, imagine what Claude can do in five years!
  [-]
  - rester324 6 hours ago
    10% on top of what we have now and the same things that the local models can do of those times ahead of us?
mungoman2 10 hours ago
The money argument is IMHO not super strong, here as that Mac depreciates more per month than the subscription they want to avoid.
There may be other reasons to go local, but I would say that the proposed way is not cost effective.
There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.
The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.
altx 10 hours ago
Its interesting to notice that here https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... we default to measuring LLM coding performance as how long[~5h] a human task a model can complete with 50% success-rate (with 80% fall back for the second chart [~.5h]), while here it seems that for actual coding we really care about the last 90-100% of the costly model's performance.
jszymborski 11 hours ago
I just got a RTX 5090, so I thought I'd see what all the fuss was about these AI coding tools. I've previously copy pasted back and forth from Claude but never used the instruct models.
So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.
I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.
[-]
- kelvie 10 hours ago
  gpt-oss-120b doesn't fit on a 5090 without offloading or crazy quants -- or did you mean you ran it via openrouter or something?
  [-]
  - jszymborski 51 minutes ago
    I'm running the MXFP4 [0] quants at like 10-13 toks/sec. It is actually really good, I'm starting to think its a problem with Cline since I just tried it with Qwen3 and the same thing happened. Turns out Cline _hates_ empty files in my projects, although they aren't required for this to happen.
    [0] https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...
  - kube-system 8 hours ago
    Sounds like a crazy quant. IME 2 bit quants are pretty dumb.
ardme 17 hours ago
Isnt the math of buying Nvidia stock with what you pay for all the hardware and then just paying $20 a month for codex with the annual returns better?
[-]
- phainopepla2 15 hours ago
  If you can see into the future and know the stock price, then sure.
  [-]
  - Muromec 14 hours ago
    The line only ever goes up, until we all cry and find a new false messiah. Or die
threethirtytwo 16 hours ago
I hope hardware becomes so cheap local models become the standard.
[-]
- rynn 16 hours ago
  It will be like the rest of computing, some things will move to the edge and others stay on the cloud.
  Best choice will depend on use cases.
  [-]
  - lelanthran 3 hours ago
    > It will be like the rest of computing, some things will move to the edge and others stay on the cloud.
    It will become like cloud computing - some people will have a cloud bill of $10k/m to host their apps, other people would run their app on a $15/m VPS.
    Yes, the cost discrepancy will be as big as the current one we see in cloud services.
  - Terr_ 10 hours ago
    I think the long term will depends on the legal/rent-seeking side.
    Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.
Simplita 4 hours ago
One thing that surprised us when testing local models was how much easier debugging became once we treated them as decision helpers instead of execution engines. Keeping the execution path deterministic avoided a lot of silent failures. Curious how others are handling that boundary.
NumberCruncher 9 hours ago
I am freelancing on the side and charge 100€ by the hour. Spending roughly 100€ per month on AI subscriptions has a higher ROI for me personally than spending time on reading this article and this thread. Sometimes we forget that time is money...
brainless 11 hours ago
I do not spend $100/month. I spend for 1 Claude Pro subscription and then a (much cheaper) z.ai Coding Plan, which is like one fifth the cost.
I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).
[-]
- alok-g 9 hours ago
  I have heard about this approach elsewhere too. Could you please provide some more details on the set up steps and usage approach. I would like to replicate. Thanks.
  [-]
  - baconner 9 hours ago
    There are a couple of decent approaches to having a planning/reviewer model set (eg. claude, codex, gemini) and an execution model (eg. glm 4.6, flash models, etc) workflow that I've tried. All three of these will let you live in a single coding cli but swap in different models for different tasks easily.
    - claude code router - basically allows you to swap in other models using the real claude code cli and set up some triggers for when to use which one (eg. plan mode use real claude, non plan or with keywords use glm)
    - opencode - this is what im mostly using now. similar to ccr but i find it a lot more reliable against alt models. thinking tasks go to claude, gemini, codex and lesser execution tasks go to glm 4.6 (on ceberas).
    - sub-agent mcp - Another cool way is to use an mcp (or a skill or custom /command) that runs another agent cli for certain tasks. The mcp approach is neat because then your thinker agent like claude can decide when to call the execution agents, when to call in another smart model for a review of it's own thinking, etc instead of it being explicit choice from you. So you end up with the mcp + an AGENTS.md that instructs it to aggressively use the sub-agent mcp when it's a basic execution task, review, ...
    I also find that with this setup just being able to tap in an alt model when one is stuck, or get review from an alt model can help keep things unstuck and moving.
    [-]
    - KronisLV 1 hour ago
      RooCode and KiloCode also have an Orchestrator mode that can create sub-tasks and you can specify which model to use for what - and since they report their results back after finishing a task (implement X, fix Y), the context of the more expensive model doesn’t get as polluted. Probably one of the most user friendly ways to do that.
      A simpler approach without subtasks would be to just use the smart model for Ask/Plan/whatever mode and the dumb but cheap one for the Code one, so the smart model can review the results as well and suggest improvements or fixes.
  - brainless 5 hours ago
    I simply ask Claude Sonnet, using claudecode, to use opencode. That's it! Example:
```
  We need to clean up code lint and format errors across multiple files. Check which files are affected using cargo commands. Please use opencode, a coding agent that is installed. Use `opencode run <prompt>` to pass in a per-file prompt to opencode, wait for it to finish, check and ask again if needed, then move to next file. Do not work on files yourself.
```
stuaxo 5 hours ago
Is the conclusion the same if you have a computer that is just for the LLM, and a separate one that runs your dev tools ?
KronisLV 5 hours ago
My experience: even for the run of the mill stuff, local models are often insufficient, and where they would be sufficient, there is a lack of viable software.
For example, simple tasks CAN be handled by Devstral 24B or Qwen3 30B A3B, but often they fail at tool use (especially quantized versions) and you often find yourself wanting something bigger, where the speed falls a bunch. Even something like zAI GLM 4.6 (through Cerebras, as an example of a bigger cloud model) is not good enough for doing certain kinds of refactoring or writing certain kinds of scripts.
So either you use local smaller models that are hit or miss, or you need a LOT of expensive hardware locally, or you just pay for Claude Code, or OpenAI Codex, or Google Gemini, or something like that. Even Cerebras Code that gives me a lot of tokens per day isn't enough for all tasks, so you most likely will need a mix - but running stuff locally can sometimes decrease the costs.
For autocomplete, the one thing where local models would be a nearly perfect fit, there just isn't good software: Continue.dev autocomplete sucks and is buggy (Ollama), there don't seem to be good enough VSC plugins to replace Copilot (e.g. with those smart edits, when you change one thing in a file but have similar changes needed like 10, 25 and 50 lines down) and many aren't even trying - KiloCode had some vendor locked garbage with no Ollama support, Cline and RooCode aren't even trying to support autocomplete.
And not every model out there (like Qwen3) supports FIM properly, so for a bit I had to use Qwen2.5 Coder, meh. Then when you have some plugins coming out, they're all pretty new and you also don't know what supply chain risks you're dealing with. It's the one use case where they could be good, but... they just aren't.
For all of the billions going into AI, someone should have paid a team of devs to create something that is both open (any provider) and doesn't fucking suck. Ollama is cool for the ease of use. Cline/RooCode/KiloCode are cool for chat and agentic development. OpenCode is a bit hit or miss in my experience (copied lines getting pasted individually), but I appreciate the thought. The rest is lacking.
2001zhaozhao 10 hours ago
Under current prices buying hardware just to run local models is not worth it EVER, unless you already need the hardware for other reasons or you somehow value having no one else be able to possibly see your AI usage.
Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s. https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...
It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.
Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.
Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)
In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.
Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.
[-]
- rester324 6 hours ago
  These are the comments of the people who will cry a f@cking river when all the f@cking bubbles burst. You really think that it's "$300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly"??? Maybe you forgot to read the news how much fucking money these companies are burning and losing each year. So these kind of comments as "to run local models is not worth it EVER" make me chuckle. Thanks for that!
Myrmornis 11 hours ago
Can anyone give any tips for getting something that runs fairly fast under ollama? It doesn't have to be very intelligent.
When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.
[-]
- parthsareen 10 hours ago
  How much ram are you running with? Qwen3 and gpt-oss:20b punch a good bit above their weight. Personally use it for small agents.
- am17an 10 hours ago
  Use llama.cpp? I get 250 toks/sec on gpt-oss using a 4090, not sure about the mac speeds
mungoman2 10 hours ago
The money argument doesn't make sense here as that Mac depreciates more per month than the subscription they want to avoid.
There may be other reasons to go local, but the proposed way is not cost effective.
dfischer96 6 hours ago
Nice guide! I want to point out opencode CLI, which is far superior to Qwen CLI in my opinion.
flowinghorse 7 hours ago
Local models less than 2b are good enough for code auto completion. Even you don't have 128G memory.
bearjaws 2 hours ago
I am sorry but anyone who actually has tried this knows it is horrifically slow, significantly slower than you just typing for any model worth its weight.
That 128gb of RAM is nice but the time to first token is so long on any context over 32k, and the results are not even close to a Codex or Sonnet.
jollymonATX 15 hours ago
This is not really a guide to local coding models which is kinda disappointing. Would have been interested in a review of all the cutting edge open weight models in various applications.
ikidd 9 hours ago
So I can't see bothering with this when I pumped 260M tokens through running in Auto mode on a $20/mo Cursor plan. It was my first month of a paid subscription, if that means anything. Maybe someone can explain how this works for them?
Frankly, I don't understand it at all, and I'm waiting for the other shoe to drop.
[-]
- lelanthran 3 hours ago
  > So I can't see bothering with this when I pumped 260M tokens through running in Auto mode on a $20/mo Cursor plan. It was my first month of a paid subscription, if that means anything. Maybe someone can explain how this works for them?
  They're running at a loss and covering up the losses using VC?
  > Frankly, I don't understand it at all, and I'm waiting for the other shoe to drop.
  I think that the providers are going to wait until there are a significant number of users that simply cannot function in any way without the subscription, and then jack up the prices.
  After all, I can all but guarantee that even the senior devs at most places now won't be able to function if every single tool or IDE provided by a corporate (like VSCode) was yanked from them.
  Myself, you can scrub my main dev desktop of every corporate offering, and I might not even notice (emacs or neovim, plugins like Slime, Lsp plugins, etc) is what I am using daily, along with programming languages.
tempodox 8 hours ago
> You might need to install Node Package Manager for this.
How anyone in this day and age can still recommend this is beyond me.
freeone3000 17 hours ago
What are you doing with these models that you’re going above free tier on copilot?
[-]
- satvikpendem 17 hours ago
  Some just like privacy and working without internet, I for example travel regularly on the train and like to have my laptop when there's not always good WiFi.
BoredPositron 15 hours ago
Not worth it yet. I run a 6000 black for image and video generation, but local coding models just aren't on the same level as the closed ones.
I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.
That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.
avhception 7 hours ago
I tried local models for general-purpose LLM tasks on my Radeon 7800 XT (20GB VRAM), and was disappointed.
But I keep thinking: It should be possible to run some kind of supercharged tab completion on there, no? I'm spending most of my time writing Ansible or in the shell, and I have a feeling that even a small local model should give me vastly more useful completion options...
dackdel 12 hours ago
no one using exo?
[-]
- redrove 9 hours ago
  https://github.com/exo-explore/exo
  I keep hearing about it but unfortunately I myself only have one mac and nvidia GPUs and those can’t cluster together :/
holyknight 17 hours ago
your premise would've been right, if memory wouldn't skyrocketed like 400% in like 2 weeks.
Bukhmanizer 15 hours ago
Are people really so naive to think that the price/quality of proprietary models is going to stay the same forever? I would guess sometime in the next 2-3 years all of the major AI companies are going to increase the price/enshittify their models to the point where running local models is really going to be worth it.
dhruv3006 7 hours ago
r/locallama has very good discussion for this!
[-]
- bjt12345 6 hours ago
  /r/localllama is the spelling, I'm forever making this same mistake.
m3kw9 9 hours ago
Nobody doing serious coding will use local models when frontier models are that much better, and no they are not half a gen behind frontier. More like 2 gen.
artursapek 12 hours ago
Imagine buying hardware that will be obsolete in 2 years instead of paying Anthropic $200 for $1000+ worth of tokens per month
[-]
- selcuka 11 hours ago
  > Imagine buying hardware that will be obsolete in 2 years
  Unless the PC you buy is more than $4,800 (24 x $200) it is still a good deal. For reference, a MacBook M4 Max with 128GB of unified RAM is $4,699. You need a computer for development anyway, so the extra you pay for inference is more like $2-3K.
  Besides, it will still run the same model(s) at the same speed after that period, or even maybe faster with future optimisations in inference.
  [-]
  - hu3 9 hours ago
    The value depreciation of the hardware alone is going to be significant. Probably enough to pay for 3x ~$20 subscriptions to OpenAI, Anthropic and Gemini.
    Also, if you use the same mac to work, you can't reserve all 128GB for LLMs.
    Not to mention a mac will never run SOTA models like Opus 4.5 or Gemini 3.0 which subscriptions gives you.
    So unless you're ready to sacrifice quality and speed for privacy, it looks like a suboptimal arrangement to me.
    [-]
    - artursapek 11 minutes ago
      Yeah, didn't even mention the fact that you can't Opus on your own hardware. Total waste of cash.
    - dchftcs 5 hours ago
      I suspect depreciation will be a bit slower for a while, because there is a supply crunch.
bubbi 3 hours ago
[dead]
chrisischris 16 hours ago
[dead]
[-]
- hmokiguess 14 hours ago
  This seems really interesting. Reminds of IPFS but for AI
h0rmelchilly 14 hours ago
[dead]