Instead of using AWS another approach involves self hosting the hardware as well. Even after factoring in energy, this does dramatically lower the price.
Assuming we want to mirror our setup in AWS, we’d need 4x NVidia Tesla T4s. You can buy them for about $700 on eBay.
Add in $1,000 to setup the rest of the rig and you have a final price of around:
$2,800 + $1,000 = $3,800
This whole exercise assumes that you're using the Llama 3 8b model. At full fp16 precision that will fit in one 3090 or 4090 GPU (the int8 version will too, and run faster, with very little degradation.) Especially if you're willing to buy GPU hardware from eBay, that will cost significantly less.
I have my home workstation with a 4090 exposed as a vLLM service to an AWS environment where I access it via reverse SSH tunnel.
Why did this only occur to me recently? You can selfhost a k8s cluster and expose the services using a $5 digital ocean droplet. The droplet and k8s services are point-to-point connected using tailscale. Performance is perfectly fine, keeps your skillset sharp, and you’re self-hosting!
You can also just directly connect to containers using Tailscale if it's just for internal use. That is, having an internally addressable `https://container_name` on your tailnet per-container if you want. This way I can setup Immich for example and it's just on my tailnet at `https://immich` without the need for a reverse proxy, etc...
Whoa, so you have code running in AWS making use of your local hardware via what is called a reverse SSH tunnel? I will have to look into how that works, that's pretty powerful if so. I have a mac mini that I use for builds and deploys via FTP/SFTP and was going to look into setting up "messaging" via that pipeline to access local hardware compute through file messages lol, but reverse SSH tunnel sounds like it'll be way better for directly calling executables rather than needing to parse messages from files first.
Look into Nebula (or Tailscale if you trust third parties). I have all my workstations and servers on a mesh network that appears as a single /24 that is end to end encrypted, mutually authenticated and works through/behind NAT. I can spawn a vhost on any server that reverse proxies an API to any port on any machine.
Well, one example, depending on your threat model—their privacy policy states that they retain info and comply with subpoenas.
There's also potential for malicious updates to compromise a network (as there is with most software unless you're auditing the source for each update).
E2EE is only as meaningful as where the keys reside, and how easily those keys are abused.
The idea of “user’s permission” is determined by tailscale and/or the oidc provider. I don’t know anything about “tail lock”, perhaps it is a new mitigation for this issue?
I didn’t start with tailscale because the only way you could log into it was with Google or GitHub or something. I don’t trust Microsoft or Google with auth for my internal network. I thought about running Headscale but Nebula was faster/easier for me.
the docs are good. when creating the initial CA make absolutely sure you set the CA expiration to 10-30 years, the default is 1 which means your whole setup explodes in a year without warning.
I feel like wire guard definitely does nat punching unless I misunderstand you, I've been doing this sort of thing to have my phone and desktop on the same "LAN" all the time so I can moonlight in from anywhere (among other things) and they're definitely natted.
Unfortunately my mac mini isn't beefy enough to run ollama, it's the base model m1 from a couple years ago lol. But it's very powerful for builds, deploys, and some computation via scripts. Now I'm curious to check out how much memory the newest ones support for potentially using ollama on it haha. Thanks!
(It may need to be Q4 or Q3 instead of Q5 depending on how the RAM shakes out. But the Q5_K_M quantization (k-quantization is the term) is generally the best balance of size vs performance vs intelligence if you can run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of course even better but you’re nowhere near fitting that on 8gb.)
Dolphin-llama3 is generally more compliant and I’d recommend that over just the base model. It's been fine-tuned to filter out the dumb "sorry I can't do that" battle, and it turns out this also increases the quality of the results (by limiting the space you're generating, you also limit the quality of the results).
Most of the time you will want to look for an "instruct" model, if it doesn't have the instruct suffix it'll normally be a "fill in the blank" model that finishes what it thinks is the pattern in the input, rather than generate a textual answer to a question. But ollama typically pulls the instruct models into their repos.
(sometimes you will see this even with instruct models, especially if they're misconfigured. When llama3 non-dolphin first came out I played with it and I'd get answers that looked like stackoverflow format or quora format responses with ""scores"" etc, either as the full output or mixed in. Presumably a misconfigured model, or they pulled in a non-instruct model, or something.)
Dolphin-mixtral:8x7b-v2.7 is where things get really interesting imo. I have 64gb and 32gb machines and so far the Q6 and q4-k_m are the best options for those machines. dolphin-llama3 is reasonable but dolphin-mixtral is a richer better response.
I’m told there’s better stuff available now, but not sure what a good choice would be for for 64gb and 32gb if not mixtral.
Also, just keep an eye on r/LocalLLaMA in general, that's where all the enthusiasts hang out.
using Tailscale can make the networking setup much easier, really like their service for things like this (or curling another dev's local running server)
I don't know enough about networking and that level of configuration to do it confidently and safely yet. I'd rather rely on using credentials where the only real access is some limited command line executables or file transfer, rather than exposing more of the hardware directly to the network for a direct connection. I do have interest in learning this much, but I find that my current FTP/SFTP approach has more guard rails than a direct connection. Do you agree with this or am I just not understanding enough about ipv6 and direct connections home?
You get the equivalent setup in ipv6 by having your home modem or router deny inbound connections.
I think port forwarding configuration is a pain that does not offer value over just poking a hole in your firewall to do an authenticated connection over ssh.
Read the other comment and also immediately thought of Cloudflare tunnel instead - is there a reason you chose that? Wondering if I should do the same with my old Titan XP (probably slower than your 3070 but it does have 12gb of vram)
No reason other than that cloudflare is quite a nice reverse proxy and I already use it for managing my DNS's. Thus far, I've not noticed any issues with it, other than when my home wifi goes down.
1B of tokens for Gemini Flash (which is on par with llama3-70b in my experience or even better sometimes) with 2:1 input-output would cost ~600 bucks (ignoring the fact they offer 1M tokens a day for free now). Ignoring electricity you'd break even in >8 years. You can find llama3-70b for ~same prices if you're interested in the specific model.
I answered the financial thinking in another reply, but another factor is I need to know if the model today is exactly the same as tomorrow for reliable scientific benchmarking.
I need to tell if I change I made was impactful, but if the model just magically gets smarter or dumber at my tasks with no warning then I can’t tell if I made an improvement or a regression.
Whereas the model on my GPU doesn’t change unless I change it. So it’s one less variable and LLM are black box to start with.
I may be wrong for Gemini, but my impression is all the companies are constantly tweaking the big models. I know GPT on Monday is not always the same GPT on Thursday for example.
I've worked professionally over the last 12 months hosting quite a few foundation models and fine tuned LLMs on our own hardware, aws + azure vms and also a variety of newer "inference serving" type services that are popping up everywhere.
I don't do any work with the output, I'm just the MLOps guy (ahem, DevOps).
You mention expense but on a purely financial basis I find any of these hosted solutions really hard to justify against GPT 3.5 turbo prices, including building your own rig. $5k + electricity is loads of 3.5 Turbo tokens.
Of course none of the data scientists or researchers I work with want to use that though - it's not their job to host these things or worry about the costs.
So my main motivation is not so much to have the lowest cost, but to have the most predictable cost.
Knowing up front this is my fixed ML budget gives me peace of mind and gives me room to try stupid ideas without worrying about it.
Whereas doing it in the cloud you can a) get slammed with some crazy bill by accident, b) have to think more about what resources testing an idea will take, or conversely c) getting GPU FOMO and thinking “if just upgrade a level all my problems will be solved”.
It works for me, everybody mileage varies but personally I like to budget; spend; and then totally focus on my goals and not my cloud spend.
I’m also from the pre-cloud era, so doing stuff on my own bare metal is second nature.
I’m doing non-interactive tasks, but in terms of the A6000 running llama3 70b in chat mode it’s as usable as any of the commercial offerings in terms of speed. I read quickly and it’s faster than I read.
Yea, for any hobbyist, indie developer, etc. I think it'd be ridiculous to not first try running one of these smaller (but decently powerful) open source models on your own hardware at home.
Ollama makes it dead simple just to try it out. I was pleasantly surprised by the tokens/sec I could get with Llama 3 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never use. Would be super cool to just have a LLM server on my local network for me and the fam. Exciting times.
I assume the characteristics would be pretty different, since your local hardware can keep the context loaded in memory, unlike APIs which I'm guessing have to re-load it for each query/generation?
Yeah but this article is terrible. First it talks about naively copy-pasting code to get “a seeming 10x speed-up” and then “This ended up being incorrect way of calculating the tokens used.”
I would not bank on anything in this article. It might as well have been written by a tiny Llama model.
This is great advice. I used to run my dev stuff on AWS, then built a small 6 server proxmox cluster in my basement, 300 cores, 1tb memory, 12tb ssd storage for about 3k usd. I don’t even want to know what it would cost to run a similar config on AWS. You can get cheap ddr4 servers on eBay all day.
Furthermore, the AWS estimates are also really poorly done. Using EKS this way is really inefficient, and a better comparison would be AWS Bedrock Haiku which averages $0.75/M tokens: https://aws.amazon.com/bedrock/pricing/
This whole post makes OpenAI look like a better deal than it actually is.
I was getting that sense too. It would not be difficult to build a desktop machine with a 4090 for around $2500. I run Llama-3 8b on my 4090, and it runs well. Plus side is I can play games with the machine too :)
Don't listen to this person. They have no idea what they're talking about.
No one cares about this TOS provision. I know both startups and large businesses that violate it as well as industry datacenters and academic clusters. There are companies that explicitly sell you hardware to violate it. Heck, Nvidia will even give you a discount when you buy the hardware to violate it in large enough volume!
In a previous AI wave hosters like OVH and Hetzner started offering servers with GTX 1080 at prices other hosters with datacenter-grade GPUs couldn't possibly compete with - and VRAM wasn't as big of a deal back then. That's who this clause targets.
If you don't rent our servers or VMs Nvidia doesn't care. They aren't Oracle.
There's no further elaboration on what "datacenter" means here, and it's a fair argument to say that a closet with one consumer-GPU-enriched PC is not a "datacenter deployment". The odds that Nvidia would pursue a claim against an individual or small business who used it that way is infinitesimal.
So both the ethical issue (it's a fair-if-debatable read of the clause) and the practical legal issue (Nvidia wouldn't bother to argue either way) seem to say one needn't worry about.
The clause is there to deter at-scale commercial service providers from buying up the consumer card market.
Assuming we want to mirror our setup in AWS, we’d need 4x NVidia Tesla T4s. You can buy them for about $700 on eBay.
Add in $1,000 to setup the rest of the rig and you have a final price of around:
$2,800 + $1,000 = $3,800
This whole exercise assumes that you're using the Llama 3 8b model. At full fp16 precision that will fit in one 3090 or 4090 GPU (the int8 version will too, and run faster, with very little degradation.) Especially if you're willing to buy GPU hardware from eBay, that will cost significantly less.
I have my home workstation with a 4090 exposed as a vLLM service to an AWS environment where I access it via reverse SSH tunnel.