I used DallE and Anki to teach my son to read. I'm not sure the HN crowd will be wowed by old tech, but this post leaves techs to the agents, and deals with the emotional aspects of learning I learned.
Yes, I wrote the clickbaity title with AI, but the rest of the post by hand.
That's a spot-on parallel! Python circular imports (especially for type hinting) are basically the software equivalent of this infrastructure deadlock.
Do you use string-based forward references ("ClassName") to break the cycles? That's essentially our "empty shell" trick — decoupling the resource identity from its configuration to satisfy the graph.
Did you stick with Tarjan's for the SCC detection on the module graph?
I haven’t had major issues with sccs yet.
The linter enforces forward references so the cycle pain we do have is with dynamic/deffered imports, and it’s usually solved by splitting a module.
If you look at the pyrefly repo (metas new type checker), there are some deep thoughts about sccs, but I didn’t fully grok them.
Thanks for the Pyrefly pointer — I hadn't tracked Meta's Rust rewrite yet. Will dig into their SCC handling.
Your "splitting a module" framing is exactly right. In the IaC world, a Security Group with inline rules is like a Python module with circular imports — it couples identity with logic. The fix is the same: extract the logic into separate resources (or modules), keep the original as a pure identity/interface.
Interesting that the same pattern shows up in both compiler design and infrastructure tooling.
I’m wrapping up a role where I spent a significant amount of time writing Triton kernels. It’s a fantastic tool, but the learning curve has some sharp edges. I wanted to share a few practical "notes from the field" for anyone moving beyond the very opaque docs.
Just yesterday I was reading through this five year old post on triton by its creator. Triton was their PHD thesis and they coined the name before the inference server was renamed to it… if this is what you are referring to?
I use Duckdb as a data scientist / analyst. It’s amazing for working with large data locally, because it is very fast and there is almost 0 overhead for use.
For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes
on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast.
The traditional alternative I’m familiar with is spark, but it’s
such a hassle to setup, expensive to run and not as fast on these kinds of use cases.
I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted
manner in a parquet file, which is ETL overhead.
Still, it’s a very powerful and convenient tool for working with large datasets locally
So I'm not super familiar with different databases, but do understand the basics and do know how to work wit data with e.g. pandas, and do think I understand what Duckdb is useful for, but what I'm still completely missing is: how do I get data in Duckdb? I.e. how did you get that data into Duckdb? Or: suppose I have a device producing sensor data, normally I'd connect to some MySQL endpoint somehow and tell it to insert data. How does one do that with Duckdb? Or is the idea rather that you construct your Duckdb first by getting data from somewhere else (like the MySQL db in my example)?
My experience has been that most of the time you don’t tell DuckDB to insert data. One is expected to point DuckDB to an existing data file (parquet, csv, json with this new release, etc.) and either import/copy the data into DuckDB tables, or issue SQL queries directly against the file.
Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.
You can do both ways but the latter is the more useful one. Duckdb is designed to read the data very fast and to operate on it fast. So you load a csv/json/parquet and then “create table” and Duckdb lays out the data in a way that makes it fast to read.
But you(I) wouldn’t use it like a
standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere
Postgres implements this[0] as well, and it's really wonderful.
It doesn't give a human the search experience they are used to, but for the superhuman who can write regex , this becomes a very cheap way to search data at scale.
I use trigram indices on a project I run[0] where I want to do cheap filtering of DB results and the performance is just outstanding; I didn't think free text search could be so fast!
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX IF NOT EXISTS lowercase_title ON streams (lower(title));
CREATE INDEX IF NOT EXISTS title_trgm ON streams USING gin (lower(title) gin_trgm_ops);
And boom, super performant search via `LIKE %{}%`.
I also love taking advantage of `TABLESAMPLE system_rows()` which lets me do hyperfast random selection without needing to randomly sort the entire table. PG has so many hidden gems.
I'm a native Hebrew speaker, where we have wild word morphology, and also worked on NLP at a large bank where we trained models to anlyze Bloomberg chats (that are almost english but behave like a DSL).
In both cases, "token-free" has been the way to go for extractive tasks like entity recognition.
It's unfortunately a technique that is off the beaten path, so a lot of extra engineering effort needs to be paid down. For example, when dealing with spans but embedding the text at the byte level, one needs to account for multi-byte codepoints that can throw off the alignment between "encoded" and raw bytes.
As the article alludes, the increase in sequence length can be very expensive. In extractive tasks, I've found it effective to use much lighter models with limited (but large) context length like ByteNet[0].
Thinking out loud, as always, there is no one-tool for everything. Often the summarization/few-shot capabilities of an off-the-shelf transformer are so far ahead that it's not worth building a model from ±scratch to solve the subtlties of tokenization. Other times, you don't need the unbounded context or have a simpler task and can forgo the power of off-the-shelf models for the specificity of a token free one.
I don't deny the incredible success of transformers, but at the end of the day they are not that amenable for interpretation.
Word2Vec was (despite its poor performance compared to transformers) much more amenable for mathematical interpretation (partition function, analogies, ...)
Rhetorical question (i.e. not directed at you): how about fixing word2vec instead of letting it rot in the history of machine learning books?
We are all familiar with word2vec's property that the vector sum ("king"-"man"+"woman") lies close to the vector of "queen". Imperfections for other words are due to things like word ambiguity (the vector of "ball" is a linear interpolation between the vectors "ball.1", "ball.2", ... if the corpus had each word annotated with the distinct word senses). Other reasons for imperfections could be limited corpus, stopping training "early" due to diminishing returns or for rational fear of overfitting, ...
These analogy sets are directed shifts. for example "man"-"woman", "bull"-"cow", "husband"-"wife", ...
Suppose one were to ask an audience to randomly give examples of antonym pairs: people would submit things like:
loud, silent
dry, wet
love, hate
poor, rich
young, old
male, female
strong, weak
day, night
etc..., then someone submits:
bad, good
at this point the imaginary lecturer stops consulting the crowd and explains that in word2vec we see antonyms appear at a similar distance from each other, and that the axes going through each pair are roughly parallel.
the crowd slowly comes to the realization that this effectively directs each pair in one of 2 global classes, say the class containing "good" and the class containing "bad".
good: silent, wet, hate, rich, old, female, weak, night
bad: quiet, dry, love, poor, young, male, strong, day
Obviously something went horribly wrong. "This is unethical and must be stopped now!" a certain person exclaims.
The programmers blame the humans who expressed the corpus. Or perhaps the team that curated the corpus. Or perhaps they shift responsibility to the end-user for using the embeddings responsibly.
Someone mutters "algorithmic bias" whatever that means, the problem surely can't be due to the simple mathematical constraints we posed, or could they?
Proof by contradiction:
Suppose there was such a thing as an antonym shift vector.
One expects that if such a shift exists, that the antonym of the antonym of a word should be near the original word. "male" + 2 x v_antonym = "male"
simple linear algebra in Rn implies that this can not exist, unless each word had the same embedding location as its antonym!
if v_antonym = 0 then the equation can be satisfied trivially, but now the model can not differentiate between concepts and their antonyms.
if v_antonym != 0 then the supposed antonym vector that might emerge from the algorithm is being forced into spontaneous symmetry breaking, so the slightest bias in the corpus will be amplified.
So yes, sometimes the oversimplification is due to the mathematician or programmer, ... and not due to the corpus, corpus curator or end-user!
=====
Now imagine generalizing word2vec so that instead of embeddings laying in Rn it lays in Tn, an n-dimensional torus.
Consider now a shift vector, with all coordinates zero except for 1 coordinate: it is halfway along the n-torus in that direction.
Applying such a shift twice would result in a zero-shifting. Aha! a non-zero vector that applied twice returns to the original position!
How many such vectors are available? Naively n such vectors (one for each coordinate), but really all combinations are valid, i.e.
so the number of antonym types on an n-dimensional torus is 2 to the power of n.
that should more than suffice to allow the existence of many types of undirected antonyms: if we negate an antonym shift vector on a n-dimensional torus, it is identical to the original shift, so antonyms pairs can form such that the algorithm does not force them into a global positive and negative class.
For example the different shifts for (child,parent,child,...) and (male, female, male, ...) would be different types of antonym shifts, and sometimes the combined shifts are occupied with words: the combined shift satisfies both (father, daughter, father, ...) and (mother, son, mother, son, ...)
Other advantages of a toroidal representation: weekdays can be placed 1/7th of a full cycle, along some directions (or indeed 0/7ths in others, or 2/7ths etc...)
Months by 1/12th, and so on
=======
Character-level tokens:
in word2vec its unclear how to generalize from words to sentences:
simply adding the word vectors of a sentence does not preserve meaning:
for example:
"the cat bit the dog"
means something very different from
"the dog bit the cat"
so concatenation of words can't be addition of vectors, since vector addition commutes: v + w = w + v.
so "the" + ("cat" + "bit") + "the dog" would equal:
"the" + ("bit" + "cat") + "the dog".
If we can swap adjacent words (due to vector commutativity), we can repeat swapping and move "cat" to the place of "dog" and vice versa.
Similarly word2vec is not used at the character level.
Observe that some words or concepts do in fact roughly commute in English:
"the green round block" means roughly the same as "the round green block"
so adjectives largely commute (although not entirely "the large red block" sounds more natural to me than "the red large block")
so we sometimes want commutativity and sometimes don't want it, depending on the pair of words.
Next consider training some kind of generalized word2vec, let's call it character2element such that somehow the character elements combine mathematically to produce a new element which roughly corresponds to what we call word vectors in word2vec.
Some words are made of 1 character, others are made of many characters.
So the elements should have some type of "product" such that "multiplying" 2 elements results in an element of the same space.
This could be satisfied with matrices (char2matrix) or perhaps "geometric algebra" multivectors (char2gamv).
In this way we train the character's (matrix or multivector) coefficients such that for example multiplying the characters of each synonym (assuming the words have exactly 1 sense) results in the same matrix or multivector. (I don't propose this as the training loss though!)
i.e. multiplying the characters q u i c k = f a s t (or roughly so)
similarily, multiplying g r e e n and r o u n d should result in elements that roughly commute (with an extra space character): the full product of "green round" and "round green" should be similar.
word2vec uses an inner product (supposedly between a context and focus word vector, although from a physics perspective the only way to get an invariant is by multiplying a covariant with contravariant coordinates, so theres a hidden metric, and instead of adding the context and focus vectors after training to get the "embeddings", I believe providing a metric and using the same contravariant coordinates would make more sense, or alternatively keep a unit metric, and redistribute the parameters freed up from the context vectors to the focus word vectors.)
In word2vec this inner product is used to get a scalar (possibly negative), and a partition function is used to get the joint probability of a context and focus word.
How should the matrix or multivector be compressed to a scalar value to pass to the partition function? Perhaps a series of informed guesses, determinant or trace of a matrix, or norm of a multivector could be made and the better performing one selected...
OR we could instead run through the corpus, split into an integral number of sentences at a time, and corrupt it (at first only one character at a time) and optimize the coefficients for autoencoding loss.
Can you read this:? 𝕆𝔹𝕍𝕀𝕆𝕌𝕊𝕃𝕐
When we read it we see the normal capitalized word "OBVIOUSLY", but somehow entirely emphasized in this mathematical doublestroke notation.
so the matrix for each doublestroke letter should be related to the matrix of the normal capitalized letter.
Suppose there was an invertible matrix M with an inverse N such that MN=1
so this effectively allows such decorative words to form
The same argument can be made for upper and lower case.
Not sure how much of what I wrote is compatible with HN's markdown...
so not only words but complete sentences would result in a "thought matrix" or "thought multivector".
Decoding such a thought multivector might be done by performing gradient descent on unit-sum interpolations of character matrices for each position in a string of such matrices.
I just discovered https://github.com/shahzainmehboob/word2matrices from about 5 years ago. There doesn't seem to be an associated paper though, and its not immediately clear what norm they used, but it looks like they flatten the matrix and compute the inner product on the resulting vector, by analogy with word2vec vectors embeddings that occur frequently should thus be closer to the 0 matrix. i.e. they seem to be using the Frobenius norm of matrices, which seems very reasonable.
They also only optimize for bigram statistics (2 matrices), so they don't utilize the associative property of matrices A(BC)=(AB)C, corresponding to string concatenation...
I used this channel and it was so frustrating it might not as well
have existed .
While the stripe team is always nice and the tech support is great, when I had a much less sever case of “stripe wants to destroy my business” I could get no straight answer or help from the support team
I built LightTag and eventually
sold it. In my experience , at lower price points (<100 month) it’s a none issue. If you can build a self serve funnel, get people in it and convert them no one will care if your 1 person or 100.
Once you go up in price point it becomes a bigger issue, I had many deals die when they realized I was a 1 man show.
That shouldn’t stop you. If you close a 50k
recurring deal once a year, that still adds up to a great income.