This is really interesting and very cool to see. The approach we're taking with Eve (gory details at [1]) is that you can treat everything as relational and doing so provides lots of benefits. One thing that wasn't clear though, was how you extend that notion down into the OS-level for both performance and semantic reasons. It's encouraging to see someone with requirements as deep as facebook's find that this strategy works in that context.
The next step would be manipulating the OS as relations. E.g. an insert into the process table allows you to actually spawn a process. It would start to get really interesting from there...
Right now osquery only supports read operations but you're totally right; if you could kick off tasks, kill processes, unload kexts, etc. via CREATE, DELETE, etc statements, that would be so killer!
So in essence, you're motivated by the same underlying concept as the Plan 9/Inferno developers: define a small set of abstractions and apply them ruthlessly, in contrast to the myriad of non-uniform interfaces one is practically using every day in an operating system.
For Plan 9, it was "everything is a file" and the power of using simple operations like bind mounts to create complex software interactions that would otherwise require monolithic protocol and library stacks anywhere else.
Here, it's... "everything is a table"? I'm not very familiar with table-oriented programming, but what advantages does having the RDBMS be the prime metaphor over the file system really bring? Structure? Rob Pike had some interesting words on that: http://slashdot.org/story/50858
"Everything is a file" means that you need to parse files to get useful data out (think things like /etc/mtab and /proc/mounts), which is closely tied to another UNIX philosophy, that tools should generate plain text and parse it using generic text-processing tools. This is great for getting things done quickly. It's also great for security holes (think CVE-2011-1749 and related issues; arguably, see also Shellshock).
One advantage of "everything is a table" is that your structures are well-formatted and there's no risk of problems when you put a space in a pathname. For most implementations of "table", you can also have the data formats be well-typed. This brings reliability and security benefits.
I think there's validity to Rob Pike's argument in many contexts -- for instance, you absolutely won't see me defending the semantic web over the greppable/Googleable one. But in the specific case of text files with a single, well-defined structure, his own argument seems to imply that there's no sense in a second tool having to infer the structure on its own.
(The usual way this is worked around these days is separate files for each field, or files designed to be parseable, which is why Linux's /proc/*/ is such a mess. Compare /proc/self/stat and /proc/self/status, and /proc/self/mounts and /proc/self/mountinfo. Also look around /sys a bit.)
There's a command line tool called q, which allows performing SQL-like queries directly on text files, basically treating text as data and auto detecting column types.
Neat, but auto-detection is exactly what I don't want. We have structure on one side. Why round-trip it through an unstructured format and attempt to guess the exact same structure on the other side? If I guess wrong, it's a security hole.
This is great. One of the frustrations I've had with Puppet and Ansible is the lack of a clear model for data. It's quite difficult to know the scope and dependencies and origin of all the variables that one deals with.
If one could update tables and then have that representation be reified to the machines it would be awesome.
>> The next step would be manipulating the OS as relations.
This approach could be a good fit for package management. So that packages are updated and run within a transaction and changes can be committed in a single seamless step.
Package managers usually have non-idempotent actions that might change parts of the operating system in non desired ways. That means that you could not have atomic operations ala SQL. There is one packager manager that solves that, Nix(from NixOS), on top of which you could apply something like an SQL language.
I'd rather see an OS and package manager that has a "functional" design (as in functional programming language, functional data structures). This would allow conflicting packages to be installed next to eachother in different branches of a functional filesystem.
The Nix package manager refers to itself as "The Purely Functional Package Manager", and that is exactly what it lets you do.
"It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments."
This seems interesting. But I find it a bit unsatisfying that it can only be used as a package manager then. How about using this functional machinery, e.g., for a general build system? A "functional make" so to speak. And I bet there are plenty of other use cases.
This would really require the underlying package management system support this, and then it's simply a shim from osquery to the package manager to do the actual work. The main problem would probably being making it work across all the disparate systems it's supposed to support.
That said, it would be awesome to query specific package versions, or even individual package file MD5s from an SQL interface to check system exposure when new exploits come down the pipe.
While you're at it, one related piece that I'd like to see explored is relations-as-code, code-as-relations a la Lisp. It would be interesting to see what a program would look like if represented as tuples in a relation, and able to self-modify by updating that relation.
Abstract, I know, but I don't think anyone has looked at this in any detail yet.
Datalog isn't what I was talking about. I would like to see a programming language where both the basic data type and the representation of code itself are relations, as is the case with Lisp and lists.
Thank you. I have had a misunderstanding of what datalog is for years. I have always thought of it as a DSL for making queries against a database using the relational calculus. I have never considered, nor have the texts I've encountered demonstrated how the syntax of datalog is itself relational, though it seems obvious now that I've noticed.
I have some planned projects which require a homoiconic relational language. I was hoping someone else could be inspired to design such a thing so that I don't have to. It looks like someone already did. I am happy to be proven wrong :)
ibdknox, here's [1] an example of a relational scheme on top of HTTP APIs, maybe this could serve as inspiration for some aspect of your project. (BTW I feel like I keep bringing this [seemingly defunct] project up... oh well).
The next step would be manipulating the OS as relations. E.g. an insert into the process table allows you to actually spawn a process. It would start to get really interesting from there...
[1]: http://incidentalcomplexity.com/2014/10/16/retrospective/