Hacker Timesnew | past | comments | ask | show | jobs | submit | more T-R's commentslogin

Doing things in an IO monad, you don't distinguish much between types of effect, everything's just in IO, and you just execute the action when you run into it, which means that you don't have, e.g., lookahead to see if you can do one batch request instead of 10 individual requests. There have been a few attempts to address these -

Monad transformers allow you to separate types of effects (so you can specify e.g., "this code only needs environment variables, not database access"), and, at least at compile time, select a different implementation for each effect. In Haskell, at least, though, they have a drawback of needing to define typeclass instances (interpreters) for every concrete monad stack (basically explicitly describe how they interact with each other - the n-squared instances problem. In practice, there's a bunch of template code to help mitigate the boilerplate).

Somewhat relatedly, Haxl, in an attempt to optimize effects, introduced a compiler change to identify less dynamic code (code that only needed Applicative), and Selective Functors, to allow for more optimization based on what's coming next.

Algebraic Effects (assuming I'm not incorrect to conflate them a bit with free effects/extensible effects) make things more dynamic, so you're instead effectively building the AST you want, and separately interpreting it at runtime. This should let you look at more of the call tree to decide on an execution plan. Since you'd also not be relying solely on the typeclass mechanism to pick an interpretation strategy, you should also be able to more easily describe how the interpreters compose, saving you from all the boilerplate of the transformers approach.


Great context, thanks


Data science does a lot of SQL-like and linear-algebra-like transformations over a lot of data, and needs it to be reasonably performant. This means you want to do things like minimize overhead of indexing into data, and use things like SIMD instructions/GPU or parallelize work. To do this, you generally want your data in column-major format - organized as objects of arrays, rather than arrays of objects. Dataframe libraries like Pandas (which uses optimized linear algebra libraries like BLAS/LAPACK under the hood, via numpy) and the Spark Dataframe API are for working with columnar data and getting performance via SIMD or parallelization, respectively.

Generally people start off by doing these computations in a series of batch jobs (an "ETL pipeline", orchestrated with something like Airflow), to transform data into whatever shape they ultimately want it in; streaming technologies like Spark Streaming and Kafka can help with incrementally adding new rows to your data, rather than recomputing the whole thing every batch-job run.

Whenever you want to involve multiple systems or multiple libraries in your dataframe transformations, there's potentially a lot of computational overhead in serializing the dataframes or just converting them between memory representations. Arrow is a standardized format, spearheaded by the person who wrote Pandas, that attempts to match the in-memory representation, so that whether you're passing the data between libraries in-memory or writing a file for some other system to read, no unnecessary transformations need to happen to work on the data.


> linear-algebra-like transformations

> To do this, you generally want your data in column-major format

I'd argue that the basic element of linear algebra is matrix vector multiplication, which I figured was best done row-major. Column major is great in other data use cases, but 'linear-algebra-like, therefore column major' doesn't feel right.


I don't know about linear algebra, but column major lets you compress thus:

* Dictionary encoding: US,US,US,US,FR -> US:0,FR:1;0,0,0,0,1

* Run-length encoding: 0,0,0,0,1 -> 4x0,1x1

* Delta encoding: 0,1,2,3,4 -> 5x'+1'

* Storing the min and max for a chunk

Basically: exploit the data type to compress it.

Which enables very fast filtering and projections. (And now that the IO bottleneck has been managed you can do your gigantic logistic regression)


It sounds like you're thinking about the mat-vec operation in terms of "Grab one row of the matrix, take the dot-product with the vector, and repeat for each row of the matrix."

But it's also possible to think of it as "Grab one element of the vector, use it to scale the corresponding col of the matrix, and repeat, summing results." Both are efficient means of finding the result, and both have block-level versions that play nicely with the machine cache.

Meanwhile, linear algebra also often involves finding vector norms, and scaling vectors, and so on, and the way we usually set up tables means that the vectors of interest are generally columns of the data tables.


This is what I was trying to get at - using column vectors gives good cache locality and lets you use SIMD for "multiply all of these by this scalar" for each column, and then for "sum all of these" for the resulting rows. I'd imagine it could also let you optimize multiplications into things like bit-shifts with minimal overhead as well, though I have no idea if that's done in practice. Maybe only tangentially related, but I feel like this talk on Halide[0] is really illustrative of the general concepts.

As others have mentioned, for some operations it can also save you from loading whole columns that aren't relevant for your transformation. The compression point in the sibling comment is definitely also relevant, especially for serialization. A whole lot of reasons to use column vectors.

Using "column-major" here might've been terminology abuse; sorry for the confusion.

[0] https://www.youtube.com/watch?v=3uiEyEKji0M


"column" here refers to a type of data. let's say you have a bunch of records of purchases. one column would be price, another column would be quantity.

if you're doing a linear algebra like transformation, you want to do it on all the prices or all the quantities, and a linear algebra library expects a big array of numbers, which is why you have to transform your records into an array of prices and an array of quantities.

"column" here refers to properties of objects, and not rows vs columns with in an array of number



They also occasionally contribute libraries like Haxl [0] and HsThrift [1]

[0] https://hackage.haskell.org/package/haxl

[1] https://engineering.fb.com/2021/02/05/open-source/hsthrift/


I have one of these; the complaint about the fan being inaccessible is legitimate, but it's otherwise clean, and easy to clean - the tank and the base can be scrubbed by hand or just thrown in the dishwasher, unlike any vaporizing humidifier I've used. I think the other complaints may be regional - and the fact that the proposed solutions all seem to be vaporizing humidifiers (or boiling water) seems telling.

I got one because in Arizona, the water quality's terrible, but the air is bone-dry to the point it turns your skin to sandpaper. The hard water means that vaporizing humidifiers fill the air with white dust that coats everything, because they vaporize the minerals. Evaporative humidifiers don't; the minerals all end up in the filter. And, while an evaporative humidifier has no trouble going through a full tank of water overnight on the lowest settings in AZ, it phsyically can't oversaturate the air like a sauna in the way that a vaporizing humidifier does.

The air in the north east just doesn't get dry enough (maybe in winter, but then you don't have the AC fighting your humidifier), and if boiling a pot of water is what the author's looking for, an evaporative humidifier just doesn't do that.


I have one too and cleaning the tank was never an issue. Instead, my problem was the wick always got moldy on me within a couple weeks. I went through a couple cycles of buying new wicks from them, but it kept happening and I gave up on it. Went back to my old ultrasonic one, which isn’t great, but it doesn’t get black mold and is better than nothing.


Same experience here. My conclusion was that the Honneywell model is just a vehicle to sell more filters, which are not that cheap in the long run.

Tired of the moldy mess, we too now run an ultrasonic mist'er. I can't say it makes a big difference, but it is quiet. Also routinely airing the bedroom with window open and ceiling fan spinning before the bedtime.

Ah, also had to plug that toxic bright blue light on our humidifier. Someone thought it's a nicely looking design, but such blue shining is hardly conducive to sleeping.


> My conclusion was that the Honneywell model is just a vehicle to sell more filters, which are not that cheap in the long run.

This is every air quality device in the industry. I have an old vornado air purifier that uses furnace filters for its media. $30 for the allergen level filters, when I can find them. When the electronics crap out in that thing will be when I learn to repair fans.

I had a prototype I built before I discovered these that used a 12V Molex power supply brick and a bunch of case fans to draw air through a furnace filter. I had it in a triangular box, intending to put it under a bed or a chair so it was dead silent. Then I found the Vornado and that went into storage. Higher cfm.

Unfortunately the Vornado filter was built at the height of the blue LED craze. But at least it’s low lumens.


> This is every air quality device in the industry.

I did not have much exposure to many. Previous to the Honneywell we had the "boiler" kind of humidifier, the one that uses two graphite electrodes submerged into water tank. Not many consumables, except that this does not work much, unless one keeps it right at their noses (as pictured on the box :). The electrodes develop some grime eventually which further decreases 'efficiency', but can be cleaned somewhat.

As for the ultrasonic one we use now, there does not seem to be any consumables as such. Maybe the ultrasonic element itself. In three years of seasonal use I can notice some either corrosion or build up on the black ring of the element, but not much difference from that.

There is some kind of filter ring, made of hard plastic, filled with small beads. But as I said in the 3 years we had no reasons to change it. Clean it, yes, but not replace it like with Honneywell.

Cleaning the thing is a drag as always. Two usual problems: the reddish mold above the waterline and inside the "gorge" and nozzle. Also the limescale, though a moderate one. We fill it with just the cold tap water, which is not too bad here.


How do you deal with cleaning the handle part of the tank? It looks like the end can pop off, but I haven't tried.


I don't have a problem getting my fingers down everywhere but the corners of the handle, so just pushing a paper towel down there works fine; I wouldn't be surprised if people with bigger hands than mine have more trouble, though. In practice, I've mostly just been throwing it in the dishwasher.


I use a kitchen sponge to get around the corner and it's been quite clean for me.


> Comparing floats for equality is not a good idea

This is very true, but the mention of data science is really significant here.

If you're trying to productionize a neural network or something similar, you ideally want to be able to pin down every source of noise to make sure your results are reproducible, so you can evaluate whether a change in your results reflects a change in the data, a change in the code, or just luck from your initial weights - with a big enough network, for all you know it decided that the 10th decimal place of some feature was predictive. I wouldn't be surprised if the author brings up tanh as an example specifically because it's a common activation function.

If you're pinning all your Javascript dependencies, but the native code it dynamically links to gets ripped out from under you, it could completely mislead you about what produces good results. Similarly, if you want to be able to ship a model to run client-side, it would be nice if you could just test it on Node with your Javascript dependencies pinned and be reasonably confident it'll run the same on the browser.

Of course, if you can't do that it's not the end of the world, since you can compensate by doing enough runs, and tests on the target platform, to convince yourself that it's consistent, but it's a lot nicer if things are just as deterministic as possible.


I'm not a data scientist, but

> for all you know it decided that the 10th decimal place of some feature was predictive

isn't this kind of a misfeature of the model? If it really depends that highly on a function output that's not guaranteed to be precise even in the best of cases and certainly not when you have to assume noisy input data... that doesn't seem to be particularly robust.


If you want true reproducibility you can save everything in an image/container. Lots of data science companies offer this for end-to-end reproducibility.


It's for defining types for existing formats, with the goal of generating types for a lot of different languages.

A lot of formats are just defined with very simple types; the machine-readable descriptions don't capture information like "this field could contain a string or an integer", or "the length of this array is dependent on the value in another field". Idris' type system is precise enough to capture that information, so specifying the type in Idris' type system should be enough to generate (at least a first-pass of) types for anything from Java to Haskell by just dropping the information the target language doesn't understand.

At the very least, it would make for some really precise, concise documentation that could save people implementing libraries a lot of time digging through prose.

Edit: Looking at the examples, it looks like it's actually value-level Idris describing Algebraic types, not Idris types, so I'm not sure if it supports dependent quantification.


> the machine-readable descriptions don't capture information like "this field could contain a string or an integer", or "the length of this array is dependent on the value in another field"

So the auto-generated code can implement correctness checks on the data, presumably at runtime? Sounds neat. Reminds me of Ada's discriminated types.

https://www.adaic.org/resources/add_content/standards/05rat/...


I don't think it's a bad thing. Like them or not, they have a somewhat controversial history; a lot of people worried that their projects would fork the ecosystem, or give them too much control of it - I wouldn't be surprised if they decided that it would be better for the foundation's goal of being a community effort to not be seen as having a heavy hand in it. Some people have expressed similar concerns about Facebook and IOHK as well, just from the financial influence they might have, even though their haskell projects haven't been as close to critical infrastructure as FPComplete's have.


Kanji are composed from a much smaller set of radicals, so you look it up by the radicals that compose it. Some dictionaries also divide characters up by stroke count. For the past 15+ years, electronic dictionaries with handwriting recognition have been popular, and now you can just use your phone keyboard.

In practice, sometimes you can also kind of guess at the pronunciation (since characters with the same main radical often have a similar pronunciation) and see what autocomplete lists; maybe that's less true outside of the joyo kanji, though.


Thank you so much for this answer.


While I don't think Japan should be judged solely on its fax machines (they're ahead in some places and behind in others), there is a difference with fax machine usage in Japan vs the US - they're a trailing indicator of another bottleneck. Fax machines are so widespread in Japan largely because they ease the process of physically stamping documents with a registered stamp (a hanko), which is used instead of a signature. So far they've failed to digitize them, and it's often cited as a huge cause of bureaucratic slowdown, especially in the pandemic. One of the goals of the new prime minister is to finally start phasing them out.


> So far they've failed to digitize them, and it's often cited as a huge cause of bureaucratic slowdown, especially in the pandemic.

Compare this to the US where Bill Clinton signed the Electronic Signatures Act[1] back in 2000.

[1] https://en.wikipedia.org/wiki/Electronic_Signatures_in_Globa...


My understanding is that the current implementation is only groundwork for possible future performance benefits, since there are no changes to the runtime system or garbage collector, and that their current intended use is mostly for allowing library authors to encode safe session/resource handling in the type system. So, new libraries can make it a type error to fail to properly close a connection or release a shared resource.

Simon Peyton Jones talk: https://www.youtube.com/watch?v=t0mhvd3-60Y


You can write safe memory management libraries in userspace with the existing extension. That can help reduce GC pressure by moving data off-heap and allow for safer & more efficient FFI bindings.

As someone who has been exploring doing gamedev in Haskell, that all is very exciting and the use-cases are constantly obvious to me.

It is entirely up to userspace to leverage this new feature. It doesn't do anything on its own outside of the type-checker. And to use it, your library will probably have to do some unsafe coercion under the hood (just like ST uses unsafePerformIO under the hood to do pure mutability.)


Right, my understanding is that in a sense it doesn't "enable new performance", because operationally speaking it's all stuff you could do anyway. What it does is move a lot of potential approaches from "absurdly unsafe" to "perfectly fine", so whatever your threshold for safety it's likely to move some speed improvements over the line if you care much about speed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: