*Instead of using AWS another approach involves self hosting the hardware as wel...

speakspokespok · on June 14, 2024

Why did this only occur to me recently? You can selfhost a k8s cluster and expose the services using a $5 digital ocean droplet. The droplet and k8s services are point-to-point connected using tailscale. Performance is perfectly fine, keeps your skillset sharp, and you’re self-hosting!

Helithumper · on June 14, 2024

You can also just directly connect to containers using Tailscale if it's just for internal use. That is, having an internally addressable `https://container_name` on your tailnet per-container if you want. This way I can setup Immich for example and it's just on my tailnet at `https://immich` without the need for a reverse proxy, etc...

https://tailscale.com/blog/docker-tailscale-guide

SparkyMcUnicorn · on June 14, 2024

And you can use Tailscale Funnel to serve it publicly. No need to pay for a cloud instance.

https://tailscale.com/kb/1223/funnel

LorenzoGood · on June 16, 2024

I essentially do this with my homelab.

logtrees · on June 14, 2024

Whoa, so you have code running in AWS making use of your local hardware via what is called a reverse SSH tunnel? I will have to look into how that works, that's pretty powerful if so. I have a mac mini that I use for builds and deploys via FTP/SFTP and was going to look into setting up "messaging" via that pipeline to access local hardware compute through file messages lol, but reverse SSH tunnel sounds like it'll be way better for directly calling executables rather than needing to parse messages from files first.

sneak · on June 14, 2024

Look into Nebula (or Tailscale if you trust third parties). I have all my workstations and servers on a mesh network that appears as a single /24 that is end to end encrypted, mutually authenticated and works through/behind NAT. I can spawn a vhost on any server that reverse proxies an API to any port on any machine.

It’s been an absolute gamechanger.

aborsy · on June 14, 2024

Why do you have to trust a third party?

It’s end to end encrypted, and with tail lock enabled, nodes can not be added without user’s permission.

hammyhavoc · on June 15, 2024

Well, one example, depending on your threat model—their privacy policy states that they retain info and comply with subpoenas.

There's also potential for malicious updates to compromise a network (as there is with most software unless you're auditing the source for each update).

E2EE is only as meaningful as where the keys reside, and how easily those keys are abused.

aborsy · on June 16, 2024

That’s interesting!

The metadata is generally public information, I don’t care about that.

The malicious updates and key abuse are more concerning. It’s true for all software, and probably better done with OS, like on iOS.

The VPN could steal the keys, but that’s a lawsuit!

hammyhavoc · on June 16, 2024

Are the keys not already kept on their own infra?

aborsy · on June 17, 2024

No, private keys don’t leave user’s devices. This is the case in all such products.

But with a malicious update, they could ship them to their infra, targeting some users. The product then becomes malware!

sneak · on June 15, 2024

The idea of “user’s permission” is determined by tailscale and/or the oidc provider. I don’t know anything about “tail lock”, perhaps it is a new mitigation for this issue?

I didn’t start with tailscale because the only way you could log into it was with Google or GitHub or something. I don’t trust Microsoft or Google with auth for my internal network. I thought about running Headscale but Nebula was faster/easier for me.

aborsy · on June 15, 2024

Yes, Microsoft and Google will not be able to authenticate to your network if you enable tail lock. A node in your network has to sign.

elorant · on June 14, 2024

Is there any resource that goes into more detail about how to setup all this?

sneak · on June 14, 2024

https://github.com/slackhq/nebula

the docs are good. when creating the initial CA make absolutely sure you set the CA expiration to 10-30 years, the default is 1 which means your whole setup explodes in a year without warning.

drio · on June 14, 2024

@sneak, can you comment on your experience with nebula vs Tailscale?

logtrees · on June 14, 2024

Whooooaaa that is mind-blowing. Thanks for sharing. <3

1oooqooq · on June 14, 2024

why either of these over plain wireguard if you're not provisioning accounts?

sneak · on June 14, 2024

Wireguard doesn’t do nat punching and is not mesh, it’s p2p only.

totally different use case.

thot_experiment · on June 15, 2024

I feel like wire guard definitely does nat punching unless I misunderstand you, I've been doing this sort of thing to have my phone and desktop on the same "LAN" all the time so I can moonlight in from anywhere (among other things) and they're definitely natted.

1oooqooq · on June 14, 2024

true. i do may punch with a pnp client on the server side.

brrrrrm · on June 14, 2024

I use my mac mini exactly as described by the parent post but using ollama as the server. Super easy setup and obv chatgpt can guide you through it

logtrees · on June 14, 2024

Unfortunately my mac mini isn't beefy enough to run ollama, it's the base model m1 from a couple years ago lol. But it's very powerful for builds, deploys, and some computation via scripts. Now I'm curious to check out how much memory the newest ones support for potentially using ollama on it haha. Thanks!

brrrrrm · on June 14, 2024

Mine is also an m1. Just use llama3, its 8b quantized by default

logtrees · on June 14, 2024

I will try it out, curious to see how it will work with 8gb of memory haha. Thanks for the heads up!

apnew · on June 14, 2024

Do you happen to have any handy guides/docs/references for absolute beginners to follow?

SahAssar · on June 15, 2024

The absolute easiest way is https://github.com/Mozilla-Ocho/llamafile

Just download a single file and run it.

paulmd · on June 14, 2024

Ollama is not as powerful as llama.cpp or raw pytorch, but it is almost zero effort to get started.

brew install ollama; ollama serve; ollama pull llama3: 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M

https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M

(It may need to be Q4 or Q3 instead of Q5 depending on how the RAM shakes out. But the Q5_K_M quantization (k-quantization is the term) is generally the best balance of size vs performance vs intelligence if you can run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of course even better but you’re nowhere near fitting that on 8gb.)

https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

Dolphin-llama3 is generally more compliant and I’d recommend that over just the base model. It's been fine-tuned to filter out the dumb "sorry I can't do that" battle, and it turns out this also increases the quality of the results (by limiting the space you're generating, you also limit the quality of the results).

https://erichartford.com/uncensored-models

https://arxiv.org/abs/2308.13449

Most of the time you will want to look for an "instruct" model, if it doesn't have the instruct suffix it'll normally be a "fill in the blank" model that finishes what it thinks is the pattern in the input, rather than generate a textual answer to a question. But ollama typically pulls the instruct models into their repos.

(sometimes you will see this even with instruct models, especially if they're misconfigured. When llama3 non-dolphin first came out I played with it and I'd get answers that looked like stackoverflow format or quora format responses with ""scores"" etc, either as the full output or mixed in. Presumably a misconfigured model, or they pulled in a non-instruct model, or something.)

Dolphin-mixtral:8x7b-v2.7 is where things get really interesting imo. I have 64gb and 32gb machines and so far the Q6 and q4-k_m are the best options for those machines. dolphin-llama3 is reasonable but dolphin-mixtral is a richer better response.

I’m told there’s better stuff available now, but not sure what a good choice would be for for 64gb and 32gb if not mixtral.

Also, just keep an eye on r/LocalLLaMA in general, that's where all the enthusiasts hang out.

riddleronroof · on June 14, 2024

Ollama is llamma.cpp plus docker If you can do without docker, it’s faster

wkat4242 · on June 16, 2024

No, the ollama default quantisation is 4 bit

brrrrrm · on June 16, 2024

I meant 8b -> 8billion rather than 70b

wkat4242 · on June 16, 2024

Ah sorry!

curioussavage · on June 14, 2024

Using tailscale might be a better and easier solution.

verdverm · on June 14, 2024

using Tailscale can make the networking setup much easier, really like their service for things like this (or curling another dev's local running server)

favflam · on June 14, 2024

You can also check if you have ipv6. I have tried both, but prefer directly connecting home.

logtrees · on June 15, 2024

I don't know enough about networking and that level of configuration to do it confidently and safely yet. I'd rather rely on using credentials where the only real access is some limited command line executables or file transfer, rather than exposing more of the hardware directly to the network for a direct connection. I do have interest in learning this much, but I find that my current FTP/SFTP approach has more guard rails than a direct connection. Do you agree with this or am I just not understanding enough about ipv6 and direct connections home?

favflam · on June 21, 2024

You get the equivalent setup in ipv6 by having your home modem or router deny inbound connections.

I think port forwarding configuration is a pain that does not offer value over just poking a hole in your firewall to do an authenticated connection over ssh.

czhu12 · on June 14, 2024

I do the same thing with cloudflare tunnels and managing the cloudflare tunnel process and the llama.cpp server with systemd on my home internet.

Have a 13B running on a 3070 with 16 gpu layers and the rest running off CPU.

Performs okay, but way cheaper than renting a GPU on the cloud.

Melatonic · on June 14, 2024

Read the other comment and also immediately thought of Cloudflare tunnel instead - is there a reason you chose that? Wondering if I should do the same with my old Titan XP (probably slower than your 3070 but it does have 12gb of vram)

czhu12 · on June 15, 2024

No reason other than that cloudflare is quite a nice reverse proxy and I already use it for managing my DNS's. Thus far, I've not noticed any issues with it, other than when my home wifi goes down.

hehdhdjehehegwv · on June 14, 2024

I dropped $5k on an A6000 and I can run llama3:70b day and night for the price of my electricity bill.

I’ve gone through hundreds of millions, maybe billions, of tokens in the past year.

This article is just “cloud is expensive” 101. Nothing new.

EvgeniyZh · on June 14, 2024

1B of tokens for Gemini Flash (which is on par with llama3-70b in my experience or even better sometimes) with 2:1 input-output would cost ~600 bucks (ignoring the fact they offer 1M tokens a day for free now). Ignoring electricity you'd break even in >8 years. You can find llama3-70b for ~same prices if you're interested in the specific model.

hehdhdjehehegwv · on June 15, 2024

I answered the financial thinking in another reply, but another factor is I need to know if the model today is exactly the same as tomorrow for reliable scientific benchmarking.

I need to tell if I change I made was impactful, but if the model just magically gets smarter or dumber at my tasks with no warning then I can’t tell if I made an improvement or a regression.

Whereas the model on my GPU doesn’t change unless I change it. So it’s one less variable and LLM are black box to start with.

I may be wrong for Gemini, but my impression is all the companies are constantly tweaking the big models. I know GPT on Monday is not always the same GPT on Thursday for example.

hereonout2 · on June 14, 2024

I've worked professionally over the last 12 months hosting quite a few foundation models and fine tuned LLMs on our own hardware, aws + azure vms and also a variety of newer "inference serving" type services that are popping up everywhere.

I don't do any work with the output, I'm just the MLOps guy (ahem, DevOps).

You mention expense but on a purely financial basis I find any of these hosted solutions really hard to justify against GPT 3.5 turbo prices, including building your own rig. $5k + electricity is loads of 3.5 Turbo tokens.

Of course none of the data scientists or researchers I work with want to use that though - it's not their job to host these things or worry about the costs.

hehdhdjehehegwv · on June 15, 2024

So my main motivation is not so much to have the lowest cost, but to have the most predictable cost.

Knowing up front this is my fixed ML budget gives me peace of mind and gives me room to try stupid ideas without worrying about it.

Whereas doing it in the cloud you can a) get slammed with some crazy bill by accident, b) have to think more about what resources testing an idea will take, or conversely c) getting GPU FOMO and thinking “if just upgrade a level all my problems will be solved”.

It works for me, everybody mileage varies but personally I like to budget; spend; and then totally focus on my goals and not my cloud spend.

I’m also from the pre-cloud era, so doing stuff on my own bare metal is second nature.

logicallee · on June 14, 2024

Super cool, thanks for sharing. Do you mind sharing what you used the hundreds of millions (or billions) of tokens on?

hehdhdjehehegwv · on June 15, 2024

Doing really nuanced classification of documents at very large scale. Needle in the haystack type problems.

elorant · on June 14, 2024

Is this at 4-bit quantization? And how many tokens per second is the output?

hehdhdjehehegwv · on June 15, 2024

I’m doing non-interactive tasks, but in terms of the A6000 running llama3 70b in chat mode it’s as usable as any of the commercial offerings in terms of speed. I read quickly and it’s faster than I read.

brcmthrowaway · on June 14, 2024

Hows your ROI?

hehdhdjehehegwv · on June 14, 2024

Absolutely phenomenal.

brcmthrowaway · on June 15, 2024

Are you using it for trading?

hehdhdjehehegwv · on June 15, 2024

Nope, powers some low-level infrastructure-ish stuff.

cootsnuck · on June 14, 2024

Yea, for any hobbyist, indie developer, etc. I think it'd be ridiculous to not first try running one of these smaller (but decently powerful) open source models on your own hardware at home.

Ollama makes it dead simple just to try it out. I was pleasantly surprised by the tokens/sec I could get with Llama 3 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never use. Would be super cool to just have a LLM server on my local network for me and the fam. Exciting times.

shostack · on June 14, 2024

How is inference latency for coding use cases on a local 3090 or 4090 compared to say, hitting the GPT-4o API?

whereismyacc · on June 14, 2024

I assume the characteristics would be pretty different, since your local hardware can keep the context loaded in memory, unlike APIs which I'm guessing have to re-load it for each query/generation?

christina97 · on June 14, 2024

If you integrate with existing tooling, it won’t do this optimization. Unless of course you really go crazy with your setup.

moffkalast · on June 14, 2024

Setting one launch flag on llama.cpp server hardly qualifies as going crazy with one's setup.

choppaface · on June 14, 2024

Yeah but this article is terrible. First it talks about naively copy-pasting code to get “a seeming 10x speed-up” and then “This ended up being incorrect way of calculating the tokens used.”

I would not bank on anything in this article. It might as well have been written by a tiny Llama model.

whoiscroberts · on June 15, 2024

This is great advice. I used to run my dev stuff on AWS, then built a small 6 server proxmox cluster in my basement, 300 cores, 1tb memory, 12tb ssd storage for about 3k usd. I don’t even want to know what it would cost to run a similar config on AWS. You can get cheap ddr4 servers on eBay all day.

causal · on June 14, 2024

Came here to say this. No way you need to spend more than $1500 to run L3 8B at FP16. And you can get near-identical performance at Q8 for even less.

I'm guessing actual break-even time is less than half that, so maybe 2 years.

causal · on June 14, 2024

Furthermore, the AWS estimates are also really poorly done. Using EKS this way is really inefficient, and a better comparison would be AWS Bedrock Haiku which averages $0.75/M tokens: https://aws.amazon.com/bedrock/pricing/

This whole post makes OpenAI look like a better deal than it actually is.

mrinterweb · on June 14, 2024

I was getting that sense too. It would not be difficult to build a desktop machine with a 4090 for around $2500. I run Llama-3 8b on my 4090, and it runs well. Plus side is I can play games with the machine too :)

kiratp · on June 14, 2024

Nvidia EULA prevents you from using consumer gaming GPUs in datacenters so 4xxx cards are a non-starter for any service usecases

EDIT: TOS -> EULA per comments below

J_Shelby_J · on June 14, 2024

What about on prem? Like, my small business needs an LLM. Can I put a 3090 in a box in a closet?

What if I’m a business and I’m selling LLMs in a box for you to put on a private network?

What constitutes a data center according to the ToS? Is it enforceable if you never agree to the ToS (buying through eBay?)

light_hue_1 · on June 14, 2024

Don't listen to this person. They have no idea what they're talking about.

No one cares about this TOS provision. I know both startups and large businesses that violate it as well as industry datacenters and academic clusters. There are companies that explicitly sell you hardware to violate it. Heck, Nvidia will even give you a discount when you buy the hardware to violate it in large enough volume!

You do you.

wongarsu · on June 14, 2024

In a previous AI wave hosters like OVH and Hetzner started offering servers with GTX 1080 at prices other hosters with datacenter-grade GPUs couldn't possibly compete with - and VRAM wasn't as big of a deal back then. That's who this clause targets.

If you don't rent our servers or VMs Nvidia doesn't care. They aren't Oracle.

kiratp · on June 14, 2024

By using the drivers you agree to their TOS. So yes, it applies even on your private network.

swatcoder · on June 14, 2024

The customer limitation described in the EULA is exactly this:

> No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

- https://www.nvidia.com/content/DriverDownloads/licence.php?l...

There's no further elaboration on what "datacenter" means here, and it's a fair argument to say that a closet with one consumer-GPU-enriched PC is not a "datacenter deployment". The odds that Nvidia would pursue a claim against an individual or small business who used it that way is infinitesimal.

So both the ethical issue (it's a fair-if-debatable read of the clause) and the practical legal issue (Nvidia wouldn't bother to argue either way) seem to say one needn't worry about.

The clause is there to deter at-scale commercial service providers from buying up the consumer card market.

nubinetwork · on June 14, 2024

That never stopped the crypto farmers...

byteknight · on June 14, 2024

They also weren't selling the usage of the cards.

jtriangle · on June 14, 2024

There are no nvidia police, they literally cannot stop you from doing this.

giancarlostoro · on June 14, 2024

It's not in a data center, it's in his home.

badgersnake · on June 14, 2024

How would they even know?

oneshtein · on June 14, 2024

Nvidia terms of what?

codetrotter · on June 14, 2024

Parent commenter used the wrong word. It’s the EULA that prevents it.

Regardless, it is true that it is a problem.

https://www.reddit.com/r/MachineLearning/comments/ikrk4u/d_c...