Hacker Timesnew | past | comments | ask | show | jobs | submit | mrtimo's commentslogin

Nice work. Have you blogged about how you built it?


What I love about duckdb:

-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).

-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.

-- if the files don't have the same schema, union_by_name is amazing.

-- The .csv parser is amazing. Auto assigns types well.

-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.

-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.


>> The .csv parser is amazing

Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.

There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?



Been playing around with Clickhouse a lot recently and have had a great experience particularly because it hits many of these same points. In my case the "local files" hasn't been a huge fixture but the Parquet and JSON ingestion have been very convenient and I think CH intends for `clickhouse-local` to be some sort of analog to the "add duckdb" point.

One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.

(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)

[0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892


Malloy and PRQL (https://prql-lang.org/book/) are quite cool


I built Shaper following Malloy's idea of combining data queries and visualizations. But Shaper uses SQL instead of a custom language. It turns DuckDB into a dashboard builder for when you all you need is SQL.

https://github.com/taleshape-com/shaper


I built https://zenquery.app and have used duckdb internally to do all procssing. The speed is crazy, schema auto-detection works correctly (most of the times) and LLM's generate correct SQL's for given queries in plain english.


This is a great sell. I have this annoyingly manual approach with a SQLite import and so on. This is great. Thank you!


Thanks for the excellent comment! Now excuse me while I go export my spotify history to play around with duckdb <3


Spotify says it will take 30 days for the export... it really only takes about 48 hours if I remember correctly. While you wait for the download here is an example listening history exploration in malloy - I converted the listening history to .parquet: https://github.com/mrtimo/spotify-listening-history


Polars has all of these benefits (to some degree), but also allows for larger-than-memory datasets.


DuckDB supports this as well, depending on which benchmark you look at it regularly performs better on those datasets than Polars.


-- hive partitioning


its 32 mb uncompressed and around 6MB compressed, its not that small https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/dist/

it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.

But yes it provides lot of convenience out of the box as you have already listed.


Nice work Andy. I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.). Something to consider for the future. Thanks.


> I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.)

We also hosted Llyod to give a talk about Malloy in March 2025:

https://db.cs.cmu.edu/events/sql-death-malloy-a-modern-open-...


Me too


I'm using DuckDB WASM on github pages. This will take about 10 seconds to load [1] and shows business trends in my county (Spokane County). This site is built using data-explorer [2] which uses many other open-source projects including malloy and malloy-explorer. One cool thing... if you use the UI to make a query on the data - you can share the URL with someone and they will see the same result / query (it's all embedded in the URL).

[1] - https://mrtimo.github.io/spokane-co-biz/#/model/businesses/e... [2] - https://github.com/aszenz/data-explorer


DuckDB can read JSON - you can query JSON with normal SQL.[1] I prefer to Malloy Data language for querying as it is 10x simpler than SQL.[2]

[1] - https://duckdb.org/docs/stable/data/json/overview [2] - https://www.malloydata.dev/


So can postgres, I tend to just use PG, since I have instances running basically everywhere, even locally, but duckdb works well too.


I have experience with duckDB but not databricks... from the perspective of a company, is a tool like databricks more "secure" than duckdb? If my company adopts duckdb as a datalake, how do we secure it?


Duckdb can run as a local instance that points to parquet files in a n s3 bucket. So your "auth" can live on the layer that gives permissions to access that bucket.


Love this! Here is a similar product: https://sql-workbench.com/


Based on this comment, you might enjoy the Malloy data language. It compiles to SQL and also have an open source explorer to make filters like what you are saying easy.


Thanks for the tip. I am checking it out right now.


It’s 2025. Let’s separate storage from processing. SQLite showed how elegant embedded databases can be, but the real win is formats like Parquet: boring, durable storage you can read with any engine. Storage stays simple, compute stays swappable. That’s the future.


Counterpoint: "The two versions of Parquet" https://qht.co/item?id=44970769 (17 days ago, 50 comments)


As I understood by reading the short description, Parquet is a column-oriented format which is made for selecting data and which is difficult to use for updating (like Yandex Clickhouse).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: