Hacker Timesnew | past | comments | ask | show | jobs | submit | genewitch's commentslogin

not picking on you, specifically, but i wonder how many people could roughly draw a cardboard box template correctly. It is an easier object than a bicycle, which people routinely have issues drawing, too.

generally, there's not "missing pieces" in a cardboard box.


Fair, that template resolves to a box but it's missing stuff like tabs to make the bottom properly stick ; and it's probably not optimal in its use of cardboard. Also it was design in a minute in draw.io to make a stranger on the internet chuckle, so lots of constraints to fulfill.

considering that ripgrep has marginal overhead over just reading the files to /dev/null, how exactly does this achieve 100x speedup?

I have a lot of use for something that can search ~1GB of text "instantly", but so far nothing beats rg/ag after the data has been moved into RAM.


The trick to optimization is not "doing faster" but "doing less". I already feel rg is missing a ton of results I want to see because it has a very large ignore list by default.

alias rg="rg -iuu"

The crate says it uses SIMD, but the crate also says that content search is 20-50 times faster. Maybe the guy unsure how fast it is or how much speedup he should claim to get recognition.

it very much depends on the platform and the operating system

for example ripgrep doesn't do any memory mapping on macos which makes it 2-3x faster just becuase of that


you can try it yourself. ripgrep search for "MAX_FILE_SIZE" in the chromium repo takes 6-7 seconds, with fff it is 20milliseconds

so essentially in this specific case it is over 1000x faster, but the repo size is huge (66G, 500k files)


i had this issue, with an even more wild set of restrictions, so i used Caddy to "output its own access log" and i had a cron job on any server at home that would hit that caddy server with a pre-defined key, so like `http://caddyserver.example.com/q?iamwebserver2j` for one server and "q?iamVOIP" for another.

https://github.com/genewitch/opensource/blob/master/caddy_ge...

https://github.com/genewitch/opensource/blob/master/show_own...

And now i have bi-directional IP exposure. it's cute because you can't tell if you just drive by, it doesn't look like it does anything. you have to refresh to see your IP, which is a little obfuscation.

if you care about security, not sure what to tell you. use port knocking.

Please note: this doesn't require installing anything on any remote, just a cron job to curl a specific URL (arbitrary URL). I used it to find the IP to ssh on remote radio servers (like allstar, d-star) for maintenance, for example.


there's iconography of a partially eaten fruit on the cases, and some of them glow.

eta: i'm just saying if i had a glowing half drank beer or partially eaten pizza on my laptop in a business meeting i am getting weird looks. Just because you all normalized glowing fruit doesn't mean the rest of us take you seriously.


there's hundreds of good books on all types of addiction, including home shopping network style, gambling / lootbox / gacha, adrenaline, sex, and so on. My spouse, at the beginning of this month, went to a 2 day series of lectures about novel treatments for gambling, as part of their CEU for their license. I know most of HN won't know what i am talking about, so:

In general professionals must be licensed and bonded. The state requires a degree and a test for the first license, then, for my spouse's, something like 8000 additional hours of training, and something like 100 hours of continuing education per year. a CEU is 1 hour of continuing education. you have ~5 years of time to transition your license by doing the above training and CEU - as a rolling window. Doctors, nurses, etc all have to do this sort of thing.

Would any of you put up with that kind of stuff to make $80k a year?


We're a two parent household and my spouse had cancer and never really got all of their energy back, and works full time, so the entirety of home, land, and car maintenance comes to me.

I homeschool our youngest because the school system here sucks, based on the experiences of our older two. I'm always exhausted. I solved this (the "parents must be more involved") by watching my kid play roblox, arguing with them about spending their money on gift cards instead of lego, posters, or whatever that isn't so fleeting; i also don't let them have a cellphone. They turn 10 in June. We don't have TV or CATV, i have downloaded most of the old TV programs that kids liked, and grandma doesn't watch kid's shows so he really doesn't have a perspective on what everyone else's viewing habits are. He watches YT on his Switch about fireworks, cars, and then also some of the idiots with too much money acting goofy, plus what i would call "vines compilations" of just noises and moving pictures, i don't get it, but it seems harmless. For the record, pihole no longer blocks youtube ads, so i was just told there are ads on the Switch, now.

But anything beyond that, i can't watch nor do i want to watch their every interaction on a computer. I gotta cook, the weather isn't always conducive to send them outside to play, as well. When i was growing up and was bored, there wasn't too much i could do about it. Today, my youngest has virtually anything on the planet just peeking around the corner. America's Funniest home videos and a blue square shooting red squares at orange squares? yeah, ok.

===========

It's getting to the point where i think people who have really strong opinions on topics like this need to disclose any positions they might have that influence their opinion. My disclosure is that i have no positions in any company or entity.

Everyone in the US has been fed a lie that if we just work hard and don't interfere with the billionaire class, that someday, we, too, can be rich like them. It's a bum steer, folks. For each 1 billionaire that "came up from the slums" or whatever, there's 100 that are billionaires because their families did some messed up stuff, probably globally, sometime in the last 200 years. And offhand, knowing the stories of a bunch of billionaires: 10 in the US that were honestly self-made, didn't fraud, cheat, or skirt regulations to become that way seems almost a magnitude too high.

i bring all of the above 2 paragraphs fore, because if one has a position in facebook, of course they're going to rail against facebook losing 230 protection for any part of their operation, instagram, FB feed, whatever. If a person has a position in GOOG, or Apple, or Tesla. What's that Upton Sinclair quote that's been mentioned twice? If someone believes that, given luck and grit, they too could make a "facebook" sized corp, but not if the government says "you can't addict children to sell ads", then i consider them a creep.

record: my oldest two are early 20s, now.


My friend and I are usually pretty good at ballparking things of this nature; that is "approximately how much textual data is github storing" and i immediately put an upper bound of a petabyte, there's absolutely no way that github has a petabyte of text.

Assuming just text, deduplication,not being dumb about storage patterns, our range is 40-100TB, and that's probably too high by 10x. 100TB means that the average repo is 100KB, too.

Nearly every arcade machine and pre-2002 console is available as a software "spin" that's <20TB.

How big was "every song on spotify"? 400TB?

the eye is somewhere between a quarter and a half a petabyte.

Wikipedia is ~100GB. It may be more, now, i haven't checked. But the raw DB with everything you need to display the text contained in wikipedia is 50-100GB, and most of that is the markup - that is, not "information for us, but information for the computer"

Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

We do not believe this has anything to do with the "queries per second" or "writes per second" on the platform. Ballpark, github probably smooths out to around ten thousand queries per second, median. I'd have guessed less, but then again i worked on a photography website database one time that was handling 4000QPS all day long between two servers. 15 years ago.

P.S. just for fun i searched github for `#!/bin/bash` and it returned 15.3mm "code", assume you replace just that with 2 bytes instead of 12, you save 175MB on disk. That's compression; but how many files are duplicated? I don't mean forks with no action, but different projects? Also i don't care to discern the median bash script byte-length on github, but ballparked to 1000 chars/bytes, mean, that's 16GB on disk for just bash scripts :-)

i have ~593 .sh files that everything.exe can see, and 322 are 1KB or less, 100 are 1-2KB, 133 are 2-10KB, and the rest - 38 - are >11KB. of the 1KB ones, a random sample shows they're clustering such that the mean is ~500B.


Veracity unconfirmed, but this article asserts that until they did some cleanup they were storing 19 petabytes.

https://newsletter.betterstack.com/p/how-github-reduced-repo...

maybe sourced from this tweet?

https://x.com/github/status/1569852682239623173

Edit: though maybe that data doesn't count as your "just text" data.


yeah i assume all the artifacts[0] and binaries greatly inflate that. I have no idea how git works under the hood as it is implemented at github, so i can't comment on potential reasons there.

Is there some command a git administrator can issue to see granular statistics, or is "du -sh" the best we can get?

0: i'm assuming a site-rip that only fetches the equivalent files to when you click the "zip download" button, not the releases, not the wikis, images, workers, gists, etc.


I don't think the issue at hand is a technical challenge. It's merely a sign, imo, that usage has surged due to AI. To your point, this is a solvable scaling problem.

My worry is for the business and how they structure pricing. GitHub is able to provide the free services they do because at some point they did the math on what a typical free tier does before they grow into a paid user. They even did the math on what paid users do, so they know they'll still make money when charging whatever amount.

My hunch is AI is a multiplier on usage numbers, which increases OpEx, which means it's eating into GH's assumptions on margin. They will either need to accept a smaller margin, find other ways to shrink OpEx, or restructure their SKUs. The Spotifies and YouTubes of the world hosting other media formats have it harder than them, but they are able to offset the cost of operation by running ads. Can you imagine having to watch a 20 second ad before you can push?


> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.


oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...

p.s. thanks for correcting me, i was using this information for something else, and now it's correct!


i once fixed a site going down several times a year with two t1.micro instances in the same region as the majority of traffic. Instantly solved the problem for what, $20/month?

Another site was constantly getting DDoS by Russians who were made we took down their scams on forums, that had to go through verisign back then, not sure who they're using now. They may have enough aggregate pipe it doesn't matter at this point


i haven't written a real song in a decade, and the previous decade only had a half dozen. 3 decades ago we and i wrote about a dozen albums.

Can you guess how old all my kids are? Write it on a slip of paper and put it in your hat for later.


Judge the judging judgers


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: