Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
Containers in 2019: They're Calling It a Hypervisor Comeback (infoq.com)
201 points by ingve on Oct 26, 2019 | hide | past | favorite | 192 comments


I've had this sneaking but hard to articulate suspicion that datacenters, bare metal servers, VMs, operating systems, containers, OS processes, language VMs, and threads are all really attempts to abstract the same thing.

You want to run business code in a way that's protected from other business code but also able to interact with other business code and data in a well defined way.

I also have this sneaking suspicion that new generations are re-inventing the wheel in a lot of ways. If you have minimal containers running on a hypervisor how is that different from processes running on an operating system? You have all these CPU provided virtualization instructions to protect guests from each other and the host from the guests but there's no reason those instructions couldn't have been developed to protect processes from each other. You have indirections to protect one guest from accessing another's storage but there's no reason processes couldn't have the same protections (and they do in many operating systems). Why container orchestration and overlay networks instead of OS scheduling and IPC?

I'm sure people in academic computer science have already published many a paper about this but it feels different seeing it from inside an IT organization where people don't seem to apply the lessons from older technologies to newer ones and we end up in this constant churn of reinvention which, as far as I can tell, is mostly a way for people, both in management and in the trenches, to keep their jobs, at least until you're "too old to learn new things" and pushed aside.


> I've had this sneaking but hard to articulate suspicion that datacenters, bare metal servers, VMs, operating systems, containers, OS processes, language VMs, and threads are all really attempts to abstract the same thing.

There is a very easy way to articulate it: they are all ways of virtualizing different facilities. Unix processes virtualize the user-mode processor registers and the address space. POSIX threads only virtualize the processor registers and the stack. Linux namespaces let you pick and choose which facilities you want to virtualize (storage, PID namespace, IPC namespace, etc.).

The abstraction is related to what is being virtualized: everything that is virtualized is a namespace (this is what naming things refers to in Phil Karlton's quote "There are only two hard things in Computer Science: cache invalidation and naming things," not to the literal naming of variables in computer programs).

Another way to go instead of virtualization is capabilities.


> this is what naming things refers to in Phil Karlton's quote "There are only two hard things in Computer Science: cache invalidation and naming things," not to the literal naming of variables in computer programs

That is mighty interesting. Do you happen to have a source for this? It's the first time I've heard it being put in this particular way.


https://skeptics.stackexchange.com/questions/19836/has-phil-...

Several references to the quote "There are only two hard things in Computer Science: cache invalidation and naming things" can be found, such as Martin Fowler's blog and others.

I'm unable to find the source of this quote, has he ever said it?

- As his only son, and colleague with him at Netscape from 95-97, I can attest that my dad did indeed throw that quote around, on more than one occasion. I'm fairly confident that he originated it (he was fond of coming up with clever quippets), though I haven't been able to figure out how it disseminated so widely over the past couple of decades. I'll keep looking around in old web archives and mails to see if I can dig something up.


The version I've heard is "There are only two hard things in Computer Science: cache invalidation, naming things, and off by one errors".


There are only two hard things in Computer Science:

0. cache invalidation

1. naming things

47. asynchronous callbacks

2. off by one errors


Zero-based indexing.


Because current model of processes frankly sucks. If I give you random binary would you run it? You can talk about sandboxing, lecture about permissions and principles of least privilege. But that still doesn't answer the question: can it run hostile code without having side effects on the rest of the system? Other than the newer web tech initiatives like WebAssembly/JS sandboxing, there do not exist any other technology that allows easy compartmentalization with somewhat reasonable security isolation out of the box for existing server code. Containers are not perfect, and hypervisors/unikernels are a pain to deploy on commodity cloud services but we are slowly getting there. (Also if you notice, newer generation of operating systems like Android, iOS, Fuchsia are moving towards strong sandboxing with every single thing locked down where possible. There is an impedance mismatch between the Unix model and current demands. You can't migrate overnight so you build technologies to work around it. From far away, Windows and Fuchsia are very similar, but Windows have to contend with decades of code compatibility needs even if they want to start shoving everything into a locked down app)


Deploying a hypervisor on top of a hypervisor (eg: a public cloud) usually makes very little sense so I don't see that as a trend. However, outside of the valley bubble many large sectors of the economy do not use public clouds (such as banking, health, defense) so hypervisors typically in the form of ESX is still very much a thing and now that we are seeing tighter integration with things like unikernels that is where the 'comeback' is coming from. I would almost go as far to say that it's becoming hip not to be using public cloud today.

Deploying a unikernel on top of commodity cloud is very achievable and easy today - I know because I'm involved with https://ops.city . I'd argue it's ridiculously easier than trying to deploy a container, clearly way safer and even faster and easier than using something like terraform.

I do very much agree with your impedance mismatch re: unix model thought though.


> However, outside of the valley bubble many large sectors of the economy do not use public clouds (such as banking, health, defense)

I work in traditional finance, outside of Silicon Valley and there is a huge drive to move to the public cloud. For the same reason banks do not own their office buildings anymore - managing computing infrastructure is not their core business and - to be honest - they are not very good at it. I know people who have to move servers and applications to a different data centre because the lease on an old DC is ending, it is a huge headache, and they would much rather concentrate on their core business.


I don't know who you specifically work for but I've been to 2 different banking tech conferences in the past month. JPMorgan mentioned they have 13 (or was it 16?) wholly owned datacenters. BoA's tech summit was just this past week in Palo Alto and they were highly negative of anything public cloud. For them it is an extremely rare thing for them to put anything in public cloud. A point they stressed throughout the summit is that the main thing they sell is 'trust' (eg: by holding people's money) and that is quite simply not something they are willing to delegate to others and that's the main driver for them but of course all the other things like cost, security, regulatory, and data gravity are drivers as well.


Since you cite JPMC, read this Sept 2019 interview with CEO Jamie Dimon:

https://www.geekwire.com/2019/interview-jpmorgan-chase-ceo-j...

“Jamie Dimon: The cloud and AI are real, huge, powerful and they’re going to change a lot for us. Of course, good companies are going to adapt to it in due course. They can’t just do it overnight, but it changes a lot more than just using the cloud — how you write your code, how you can do agile, it even changes how you organize your company. Managers start to think about using data in a far different way than they ever did before. It changes everything. It’s a huge opportunity.”

...

“When cloud first got started, there was no reason to use someone else’s data center or their networks. You know, I wasn’t completely wrong, if I can run my data center as efficiently as a cloud guy, which fundamentally is true.”

”But here’s what I missed: the bursting part of the cloud. It reduces the peak capacity that we’ve got to keep in our own data centers, and you can burst excess workloads into the cloud. And it directly relates to AI. There’s just so much compute power in one split second that I simply can’t do in one of my data centers. So if you want to apply AI and these tools, and also the services that are built into some of these platforms [the cloud offers an advantage.]”

// Disclaimer: I work at a global bank.


Hypervisor on hypervisor makes sense if you care about security. Your average FaaS-on-Kubernetes hosting services not from one of the tech giants is a far cry from AWS Firecracker (Lambda) level of isolation. E.g. you run a gaming server and wants to allow arbitrary user plugins, or perhaps a challenger fintech bank that wants some simple user scripting. You want to allow Python/Ruby/any lang except Lua/JS/WASM. You won't be able to find anything hosted that doesn't cost an arm and a leg. (On an unrelated note, if anyone from Zeit is reading this, can you comment on why now.sh container support was deprecated in V2? Was it because of cost?)


I totally agree that if you care about security you should be running in the opposite direction of k8s.

Google Cloud has this option (nested hypervisors) but not for security reasons - it's used to run other software that comes packaged as a vm yet that comes with a serious perf hit and which is why I don't see that ever catching on as a trend. Selling bare metal servers w/pre-installed hosted firecracker would make much more sense (if you want/need hosting).

If you are interested in isolating a particular application to a given interpreter or a given binary the aforementioned unikernels are precisely what you are looking for and those can be provisioned on t2.micros.


> I totally agree that if you care about security you should be running in the opposite direction of k8s.

The parts of k8s that are sticky and hard to change are the API abstractions like Services, Deployments, and Nodes. That's also where a lot of the value add is. Like AWS cloud formation. The only part that is docker-specific are the docker images.

I don't think the sandboxing techniques (i.e. docker) will stay the same for long. There are already multiple initiatives to implement hypervisor-based pods. I think firecracker has a lot of inertia right now, and I'm pretty sure RKT containers with KVM isolation were in beta a couple year ago.

The point is, I don't see how kubernetes has all that much to do with security (at least as far as isolation goes). Docker alternatives will gain traction and k8s will offer them as pod runtimes. Maybe the direction you mean we should be running is the opposite one from Docker?


Your analysis is on point and brilliant. I would upvote this more if I could. Not enough people on HN understand the weaknesses of Kubernetes and existing self-hosted cloud software. (And more importantly, what hypervisors brings to the table)


Wanted to share my first experience with the ops.city site. This is unsolicited, but I do hope it helps your team succeed:

I opened the 'Why' tab and found blurbs I wanted to see more about. For example, security through smaller code+attack surface sounds interesting, 'no ops' (didn't grok the bullets here, but I was interested to know more), and higher performance are all interesting

The bullets were interesting enough for me to want more, so I opened the video.

Unfortunately, the video was really vanilla and didn't expand on these concepts at all. It shows how easy ops.city is, but gives me nothing more on why it is better than ten other technologies that have similar 'easy intro' videos.

What I was hoping for was a 5-10 minute video focused on benefits of your approach vs alternatives. For example, some talk or demo on improved performance, data on how much code/attack surface is reduced, comments on OS/kernel memory footprint (this is a problem for us - clients like using 500MB "tiny" VMs on public clouds that hang running cadvisor), etc.

I could leave your site to answer these questions on my own. I did this and found some promising looking links over on the nanovm website. But that is an awkward experience - leaving your site to try and understand your offering.

Perhaps I'm not your target audience. For context, I do a minor amount of devops as part of delivering initial R&D products to multiple clients. We are not ops experts, but we try to deliver future proof work, which involves us staying apprised of new approaches and making them available to clients when they are a fit.


> From far away, Windows and Fuchsia are very similar, but Windows have to contend with decades of code compatibility needs even if they want to start shoving everything into a locked down app)

DOS and such were designed around everything on the device being shared, with conflicts resolved at the application layer. Windows was built to maintain some (even ideological) compatibility with that world, while adding some of the benefits of separation and sharing (for example, moving away from the world where every piece of a job could crash/co-opt/exfiltrate the entire device).

Unix (Multics, CTSS, ITSS, etc) was built to enable a world where multiple jobs (not processes; even then lots of jobs were multi-process) and multiple people could share the device. Threads, jails, cgroups, etc were added to move to a world where sharing wasn’t entirely cooperative and trusting. VMs, containers, and hypervisors are on the recent end of that same movement, along with wasm and JS sandboxing.

In the end, it’s a balancing act between performance (speed, power, cost) versus safety, often starting from different points. Exo-, uni-, and library-kernels are similar efforts that haven’t (yet?) caught on, but there’s a pretty clear direction of movement towards the strongest isolation that our (currently quite flawed) hardware can afford.


This comment made me remember my first Mac where when you launched an app it was given complete control over the hardware, with ASM calls to the Mac Toolbox.

Remembering these times it makes me wonder why should there be only one direction, and that maybe there is an alternative to putting all our most sensitive data in every internet connected devices? In old times when you had a computer to do some hobby like games, photo editing, music, etc... the data was only related to that activity, and in the worst case of nuking the hard drive you lost only that work.


Other than the newer web tech initiatives like WebAssembly/JS sandboxing

Yes, this is the lightest weight isolation that's somewhat secure for random code.. And considering that we regularly find exploits that escape that sandbox, I would never use it to run some random persons code on a server. Yes, we do that all the time in the browser, but that's a more limited environment where a single user is choosing what to run... whereas on a server, anyone who wants to attack the service can just upload whatever code they want.

I know Cloudflare is doing it in Cloudflare Workers... and I think that's crazy. They're inspecting the code, and they added OS sandboxing... so that's a little better.. but trusting the JS sandbox in that environment is just nuts. Wouldn't surprise me if this ended up being another cloudbleed.

If you want to run random code, the only safe option right now is to use a heavier form of isolation like a VM (and even that is not perfect). The performance loss is the necessary trade off to allow random people to run code (some of whom are deliberately attacking the service). This is the downside to the "cloud"... performance losses to allow random people to share servers.


An idea I've been toying around with but don't have time to fully explore or implement:

The unit of isolation should be the library, not the process. Each time you call into another developer's library, the library should have only the privileges you explicitly pass to it as capabilities, and you should be able to specify whether these capabilities may be delegated outside the receiving library and whether they can outlive the method call. So every time you call outside your own code assembly, you know that that call can't access the network, can't write to disk, can't display anything to screen. If you need it to, you explicitly pass the connection to a particular host or a handle to a particular directory, and then it can't access anything outside of that.

Not sure if this is feasible performance-wise: if you do it at the binary level, it seems like it'd require remapping page tables and flushing the TLB with every external function call, which'd absolutely kill frameworks that have to execute callbacks many times a second. But there's a certain intuitive sense to making the unit of trust & communication the code written by a certain organization, and it also fixes attacks like npm malicious modules that haven't gotten all that much attention nor have effective mitigations beyond "audit all your dependencies".


Given that some of the most commonly used external libraries do networking, IO, graphics and encryption, I don't think this would really be feasible.


Presumably they'd have to be rewritten, but if you're talking a new OS anyway everything's gotten be rewritten, because you won't have the standard UNIX syscalls. This is not something to embark on lightly, which is why I haven't embarked on it yet.

But imagine a future where using a traditional OS means there's close to a 100% chance that you will have your identity and/or financial assets stolen, or where these computers are widely used in the military and a security failure literally means you're dead. We're headed there anyway, judging from all the headlines on data breaches and identity theft. There'll be system failure at that point, as well as a strong incentive to rewrite everything as a secure system.


It's feasible in the same way bounds checking is feasible. Everything could have a capability token attached to it and checked. Compiler could generate all the language level checks, kernel could do OS level checks. Main program could have some built in capabilities and capabilities it gets from the other OS processes, it could then pass them or devise new ones to library functions, they could pass them to other library functions and so on.


But that means you have to upload raw source code. It's closer to Apple's app store deployment than to what current cloud hosting offers (arbitrary compute power for anything but crypto mining).


Not necessarily raw source code, could be a code compiled into an intermediate representation, but it has to have capability-based security deeply integrated into it, like bounds checking is into programming languages.


> But that means you have to upload raw source code.

And what is wrong with that? JavaScript source-code-only is by orders of magnitude the most popular and successful software distribution method.


In a business environment the code you run is typically the code you wrote.

You generally don't run unfamiliar code if you do, you are not being careful. All it takes is to have a container contain a backdoor that phones home, and essentially give someone access to inside of your network.


This gets silly, though. If you are going for full isolation where applications can't just let the user have data, then you have to coordinate every app talking to every other app.

This is why the share button on a phone is ridiculous. Instead of just copy to clip board, you get a ton of options.

Is it safer? Hard to say. I can do less and have less capabilities without more active work from the developers.

There is something to be said for this, of course. But it seems a losing game. I contend there is no technical panacea to security versus convenience.


> This is why the share button on a phone is ridiculous. Instead of just copy to clip board, you get a ton of options.

Off topic, but I don't understand your objection...

On Android Chrome the share button shows you a context sensitive choice of apps that will accept the object (a link, some text, etc), followed by a second choice of what the app can do with the object.

The flow for copy-then-paste is more convoluted and is waaay less discoverable. Copy then paste is still available for the situations where it has an advantage.


copy-then-paste is only more convoluted if you haven't learned how computers have worked for a long time.

And things have gotten some better for the phone option. But I don't have or want all apps installed. So, my sharing is dominated by copy/paste, still.


There is no technical panacea, but containers are a useful abstraction. Containers bring back the original intent of the unix process/user security model, allowing you to scope each process to have it's own 'private' filesystem to get around the broken way executables are built, packaged, and linked.

This doesn't save you if you do dumb things, like have open ports that allow administrative access that are exposed to the rest of your (potentially malicious) system. They do force you to think about some of the questions, though, like "Where is my data stored and how do other tools interact with it?"

Most of the problems that people complain containers don't solve are ones that are open ended, and have no definite solution.


containers are a great tool for corporate servers. For end user machines, it is a tougher sell.


What do you think of the permissions associated with Android apps? Each developer gets a separate and cordoned user account on your phone. They only have access to private storage unless they request it through special APIs.

Android apps are run in containers, for some broad sense of the word. That's how it should be. Every app I run should be run with a simply configured set of permissions, and "per app" or "as user" makes sense, even in the desktop space. That's what snap originally tried to do (but failed at, and have now broken)


Isn't that where I said it can get silly? I don't want developers having an account on my phone. I just want to process some data on my phone.

Consider that on my phone, I am limited to accessing my data for most apps through the apps. On my computer, I typically know where the data is stored.


What you're seeing is the endless see-sawing between "performance needs to be higher" and "isolation needs to be stricter".

Each alternative lands on a different balance at different times. Partly because of their own innate qualities, partly because of path dependency, partly because a "new" approach can sometimes reveal previously-unnoticed assumptions and partly because the ratios of CPU, RAM, disk and network performance are ever-shifting and thus favouring this or that architecture in the moment.

What tends to happen is that performance gains draw the first wave to a "new" technology. Then as the bulk of the industry begins to move, isolation, robustness and backwards compatibility become relatively more important. These impose overheads that will eventually be swept away by the next "new" technology.

There are still qualitative differences, though. Where you draw the firm lines in architecture affects the range of possibilities. I can place tenancy boundaries "anywhere", but in practice a VM hypervisor will be able to provide a firmer boundary than asking people not to share passwords.


What I’ve always failed to Understand is how FreeBSD jails[0] never got very popular (discounting the fact that FreeBSD isn’t very popular on the whole from what I can tell) but Docker is huge. I personally think jails are superior in implementation in that it requires no other abstractions on top of the OS.

The only thing I can surmise is that Docker might have a better secure default, but improvements to Jails could emulate that, and following best practices as well should. Beyond that I’m still baffled how this didn’t take to the mainstream.

[0] https://www.freebsd.org/doc/handbook/jails.html


This happened because docker, in addition to an isolation system, also bundled a user friendly interface to a per-app persistent filesystem. No matter how many people sing the praises of isolation and security to Docker, I will continue to suspect that almost all of its adopters use it because packaging software with dependencies is hard, poorly understood, terribly tooled (looking at you, Python), and even more poorly executed in the vast majority of projects and companies.

Docker gives you a very simple way to not think about that (at least until your massive container that bundles ancient versions of a dozen different CVE-filled libraries bites you in the ass down the road).

It's not novel. It's not even particularly elegant. But Docker users don't want elegance; most of them just don't want to think about how to configure a production environment to work like their development environment, so we get the old joke: "It works on my machine!" "Then we'll ship your machine"...and so we got docker.


> No matter how many people sing the praises of isolation and security to Docker, I will continue to suspect that almost all of its adopters use it because packaging software with dependencies is hard, poorly understood, terribly tooled (looking at you, Python), and even more poorly executed in the vast majority of projects and companies.

Here is a quote from Eberhard Wolff's _A Practical Guide to Continuous Delivery_:

"It is laborious to install a real application including all components. Of course, it is even more laborious to automate this process... When an installation crashes, it has to be restarted. In such a scenario the system is in a state where some parts have already been installed. Just to start the installation again can create problems. The script usually expects to find the system in a state without any installed software... This problems is also the reason why updates to a new software version are problematic. In such a case there is already an old version of the software installed on the system that has to be updated. This means that files might already be present and have to be overwritten. The script has to implement the necessary logic for this. In addition, superfluous elements that are not required anymore for the new version have to be deleted. If this does not happen, problems can arise with the new installation because old values might still be read out and used. However, it is very laborious to cover and automate all update paths that occur in practice."

If a self-professed expert that writes books on the subject does not realize that package managers exist, and proposes that the only alternatives for software installation are either hand-hacked shell scripts or Docker, what do you think the average developer knows?


To be fair to Wolff, it's not uncommon for deb and rpm packages of complex daemons (e.g. postgres) to include shell scripts - postinst, preinst, etc - that makes changes to the system, and which do have to be coded in such a way to handle re-execution after partial failure. Taking my /var/cache/apt/ directory as a (poor) sample, about 1/3 of packages had such scripts.


that libc update spawned a wealth of dependency solutions. And 'dll hell' of course


In my experience even the best of package managers are fundamentally ill suited for deployments because of assumptions about use cases.


Can you elaborate?


Yes, this is exactly right. Dependency management is a hard problem, and with Docker you only have to get it right once. I think Docker wouldn't exist if Amazon had shipped a way to build and test an AMI locally.


There is already a solutions such as Guix and Nix, but then people aren't willing to learn a new language that allows expressing dependencies.

Docker solves dependencies in same way a disk image does, you save the image and it should look the same way each time you look at it. The Dockerfile is not a reproducible though, you just list steps iteratively to generate an image, but it doesn't guarantee to produce the same result. I saw multiple times scenario that docker image built fine for one person and didn't work for another.

The main reason for it is that majority of Dockerfiles rely on network to build the image and files received (or even your apt-get command) might produce different results for different people.

Recently I saw example where even a docker image built on build system didn't work on deployments system. Both machines of course were x86_64. Turned out that one of dependencies enabled compile optimizations for the CPU it was built on. You could argue that it isn't Docker's fault, but isn't its promise to provide reproducible builds?


Agreed. Nix and Guix actually solve the dependency problem, Docker allows you to work around it. But a work around is better than nothing.


Well, but my point there isn't nothing, there are alternatives, which can (or at least nix can) generate docker or OCI containers.

Docker is trying to sell itself as something more than it is, in reality it is just an overglorified zip file.


> I will continue to suspect that almost all of its adopters use it because packaging software with dependencies is hard, poorly understood, terribly tooled (looking at you, Python), and even more poorly executed in the vast majority of projects and companies.

Static linking solves a lot of these deployment issues. But I guess people are worried about duplicating libraries and security vulnerabilities. So we end up with Docker which has these problems, but even worse.


Can't easily statically link a big complex application written in a dynamic language.


That's a reason to avoid dynalangs. Otoh, java and .net have ears and assemblies


Depends on the language, doesn't it? Many Scheme implementations have "unexec".


Yep, never underestimate a good interface. I can do most of what ZFS does with dm-crypt, llvm, etc but that’s a handful of interfaces to learn. ZFS includes it all. (Probably a bad analogy but it gets the idea across).


But they are popular...

Jails and chroots are used heavily in iOS and Android. I would say its one of the fundamental differences between mobile apps and desktop apps.

Its not enough though. You need docker to handle the DNS, routing, port forwarding etc. if you want to emulate a pocket cloud on a dev machine. I suppose the networking aspect could be done for jails as well but docker and docker-compose just seem more polished in this regard.


> You need docker to handle the DNS, routing, port forwarding etc.

If you do IPv6, there is no need for all of these layers of IPv4 overlay network crap.


IPv6 saves you from setting up DNS on each instance and will do load balancing?


Depends on what you use DNS for. Because the /64 space is so large, and you have a very large, very flexible subnet above that, you can set up an address assignment scheme for non-public facing nodes where you do not need DNS, and you can integrate that scheme with your load balancer. See Coffeen's _IPv6 Address Planning_ for ideas.


That seems like a lot of work and you're ultimately not going to be able to use this in full deployment. Docker models and maps tobcloud infrastructure quite nicely and that's a reason it popular.


I interpreted the above comment to be about general purpose computers, not mobile.


Docker is convenient. You can just download the docker binary and type "docker run postgres" and have a container running Postgres. What's the equivalent for FreeBSD jails?


There is no equivalent of course. I guess people are confused on what jails are. Jails, Solaris Zones, lxc, runC, chroot with namespaces and cgroups are all similar technology that are not solving the same problems as Docker. Docker just uses containers underneath to address specific package management problems. Projects that try to do the same are actually just package managers, like nix and guix.


Also I think many people think it is very complicated to setup database, I just run it yesterday on OS X the command issues were:

    nix-env -iA nixpkgs.postgresql_11
    initdb ~/data
    pg_ctl -D ~/data start
    createdb
First command installs the database, second one populates database files, this one starts the database, last command is optional and it creates a database for your user name.


Docker is a turn key solution to three different problems (build, package, run). A lot of people know of alternatives that cover only the run part and then they wonder why their particular solution didn't become as popular.


Chef habitat


Because Docker has better dev marketing. Some sizable portion of developers actually believe Docker invented containers.


This, pretty much. Nearly everything written about Docker basically treats it as synonymous with containers. This is also incredibly annoying if you're already familiar with the general concepts of containers and actually want to read something about Docker in particular.


Jails are more like runC, container tech behind Docker.


I've tried a few times to get jails working but always stumbled over something. Last time it was inbound connections for a server. After several hours I gave up. There's iocage now, but this was well before that. And iocage still has issues.

Docker I could get up an running in less than an hour and it just worked.


I'm smiling reading your comment, having fought with jails/iocage and networking in the past. The counterpoint is that you likely didn't have inbound connection problems with Docker because it seems to quite effectively bypass all of the iptables rules on the host.

I didn't dig too deep into it, but I had iptables rules set up to only allow inbound 22 and 443, but all of the exposed ports on the containers were still externally available.


I was just going to post about jails but you beat me to it.


better marketing I would assume.


I am still waiting all the docker fans to acknowledge bhyve[1]

[1] - https://bhyve.org


Good point. The current widely used kernels seem to make some assumptions about ‘owning’ the hardware and implicitly trust any running programs at a certain level.

There is a need, however, for some pragmatism.

While a capabilities-based OS could be the right solution, it doesn’t exist today.

What we have is code built for some flavor of POSIX. People demand cost efficiency today.

So, engineers do what’s practical: build a multi-tenant system running code that expects to be the owner of the hardware.

It works good enough! For most people, the cost efficiency math works out. The sand boxing overhead is worth it, as it pales in comparison to the headcount required to manage your machines, or even VMs, if you replace them with Serverless products.


From memory, I seem to recall that Genode takes this to its logical conclusion: every process is isolated with virtualization primitives.

As to why: isolating processes the old way needs jails to work properly, BSD lost the popularity contest, and Linux jails didn't get secure enough before VMs and containers took off.


Did the BSD jails provide enough isolation? Would memory from a process truly be isolated from another? How about FD’s?

Could there be side channel attacks? At what level?

Perhaps the whole kernel design needs to be revisited, which I assume is what’s being taken on by Fuchsia.


I’ve seen no evidence that Fuchsia is bringing much more than 90s era embedded OS design. Just because it’s not Linux doesn’t mean it’s new.

Not that this is a bad thing, but someone please correct me if there is something fundamentally recent in Zircon.


The BSD jails / Solaris zones approach is not the same at the path taken by Linux. Linux gives you a facility to isolate network, a facility to isolate process views etc. You put them all together and you get a "container". The former starts off by having a container primitive that can't do much because, well, it's contained from everything. You then proceed to give it access to the network, the filesystem.


What do you mean? Isolating network and isolate processes view is what FreeBSD jail was always doing. The most common use of freebsd jails was providing VPS servers to users.


No argument there. My point is that the default for FreeBSD jails is "isolate everything" and it's up to the user to open it up.

My post was in response to 'sayhello' that is wondering whether jails provide enough isolation.


I see, I misunderstood what you meant.


Threads are not the same because they are separate computation in the same address space.

At some point though, permissions to disk, network, screen, input devices, etc. need to be an integrated part of computing for anything that runs more than one program.

Take a look at QNX though, at some point it will be reinvented and heralded as a breakthrough, 40 years after it happened the first time.


Processes running on the same host are a "separate computation in the same address space" - They share a filesystem namespace.

Pods running on the same kubernetes cluster are a "separate computation in the same address space" - They share a DNS namespace.

Servers running on the same internet are a "separate computation in the same address space" - They share an IP namespace.

These are the same abstraction at multiple levels, and that's okay, because you often do need that. The L1 and L2 cache are crystaline to inmemory and redis caches in distributed systems. Components with the same shape show up at multiple levels. This is okay. What's important is that those levels each serve a purpose, and that you reduce the abstraction to be as simple and meaningful as it can be.


Processes exist in separate virtual memory spaces. Threads in the same process exist in the same virtual memory space. This is a fundamental principle of virtual memory.

Mixing this up with a filesystem or any shared resource is completely nonsensical. Naive oversimplifications like this indicate a lack of understanding of how and why computers are architected as they are in the first place.

Virtualization could be said to be an extrapolation of virtual memory and in fact does offer sub virtual memory, but to equate threads, processes, virtualization and shared resources as the same thing is grasping for simplicity that isn't there by making comparisons that aren't true.


A good analogy is network layers. The problems and solutions handled at the network layer resemble those at the internetwork layer, which resemble those at the transport layer, in turn similar to those at the application layer.

Convergent evolution doesn't mean we should all become the same organism.


That's a good point. But as I think about it in a sense threads are also a form of isolation in that they allow you to isolate units of business code so that they can run independently and be (potentially) protected from other business code monopolizing a shared resource while not actually doing work.


Threads are not a form of isolation, they have nothing to do with ' business code ' and they offer no protection at all. I'm not sure what construct you are talking about, but it is not threads.


I also have this sneaking suspicion that new generations are re-inventing the wheel in a lot of ways.

Suspicion fades away, and the older you get, the more often you see this happen.


I think you could build a secure application environment with just bare metal.

The only hip tool I would use is Ansible after paying my respects to cfengine.

No virtualisation No containers No overlay networks

A few VLANs perhaps. If I may.

It can be done.

Just a few dedicated people. The kind of people colonel Kurtz talked about...


I'm working on something like this with the DebOps[1] project. It's a set of Ansible playbooks and roles that lets you manage Debian environments. I aim for it to be agnostic - roles can be used effectively both in LXC containers, virtual machines or bare metal.

You can start with a few machines with Debian netinst + SSH installed, and build them up with the services you want. Applications are installed either from Debian or upstream repositories (via APT), downloaded directly from upstream with proper signature verification, or compiled from git sources. There are no OS images, therefore no containers or VMs are strictly necessary.

It seems that in the age of Kuberenetes projects like this are a niche, but if somebody is working with on-premises infrastructure, bare metal hosts, a good set of tested Ansible roles which you can use as a base for your own deployment might come in handy. I haven't found anything close to what I had in mind back then, so I started my own, going at it for 6 years now. Hopefully it will be useful to somebody else at some point. :-)

[1]: https://github.com/debops/debops/


Thank you for sharing this!


Is Ansible still hip or has it gone the way of MongoDB?


Ansible is still a practical tool for some of configuration management tasks, as opposed to mongo which I have hard time finding a niche for.


Did I write this comment? I've had the same feeling. I work on securing VMs in a hypervisor and part of me wants to look at ring 0 to 3 instead of ring -1 to 0 and see if I can fix that instead of bloating cpu specs with more rings.


“function” for suitably principled definition of function


And let me state again for the record that all of these promises being made by container systems sound an awful lot like the promises I was offered by 'real' operating systems in the early nineties.

I think the only real difference is that there has been a sea change in public opinion on this kind of aggressive isolation by default being worthwhile.

But a hypervisor publishing a bunch of services that talk to the world and each other? Things are beginning to look a bit more like microkernels as time goes on.


Plus now, the app devs, now all turned devops because talent is scarse and demand is high, get to handle the complexity of concurrency, networking and reliability. This adds to deployment/testing/development setup, and when you change stack or job, the architecture slightly different and you need to learn it all over again.

Funnily enough, we couldn't make the microkernel concept works despite an homogeneous, integrated environment as a basis. But we are fine building frankenstein OS images, hacking multiple graph-shaped systems to make them play together, and call that a success.

Now don't get me wrong, I understand perfectly the benefits of it. But somehow, it feels like we arrived at the right result by the worst possible path, and at a price way higher than it should be.


What if the pure microkernel approach is flawed and hypervisors offer a sensible middle ground?


Most of the people that rail against pure microkernels have zero experience actually building stuff with them. Those that do have that experience have hard proof that systems built in this way are more reliable and easier to build than others. Hypervisors have their own place in the hierarchy, they do not compete effectively with microkernels on their own turf.


The typical problem with microkernels is poor backwards compatibility, mainly due to POSIX. If you add POSIX compatibility, you don't gain any isolation, performance or security benefits of microkernels, and changing your thinking and your programs to gain those benefits requires significant work when they're written against the POSIX API.

Building them that way from scratch is pretty much just as easy as writing programs in UNIX, modulo the lack of tooling that is mostly POSIX-based.


If you don't mind this tangent question...

Is there a microkernel API that has more acceptance nowadays? In other words, where should anyone wanting to work on real world / enterprise systems based on microkernel look?


I don't know if that's the idea, but Genode works over a few microkernels.


(original Genode dev here)

You nailed it: From day 2 or so we imagined to develop Genode as the future common API for the diverse range of microkernels and showcased the approach with all L4 kernels we could get our hands on. The vision is still to provide a replacement for POSIX in the microkernel world ;-)

Regarding POSIX, we don't think that tools and applications developed for this interface do not fit the microkernel world but on the traditional systems level POSIX (and its tools) is not sufficient to make the best of out of the microkernel advantages. For example, we run libcurl, GPG, libarchive, and coreutils in an orchestrated scenario to implement a robust and secure deployment/installation subsystem in the Genode-based Sculpt OS https://genode.org/documentation/articles/sculpt-19-07.


seL4: https://sel4.systems/

The kernel is even formally verified, and it's very widely deployed.


Define wildly.

The Okl4 kernel in iPhones is extremely limited in its functionality, you could probably run that code even without an Os. It could also be argued if in its current form it is really a microkernel. Furthermore, sel4 and okl4 are not the same product.

(nothing bad on sel4, I have used it myself and I personally know the people behind the project. But it's not what you think it is)


Not what I think it is? It's a capability secure microkernel, that's literally all I've ever claimed. I'm not sure what you think I've been saying.


seL4 looks quite interesting. What use cases have you used it in?


Isn't Windows NT the poster child for a non-POSIX mainstream microkernel?


NT kernel is not really a microkernel. It tool inspiration from microkernels but it's still a highly modular monolithic kernel.


Would you consider the gnu hurd folks experienced enough?


No. I thought GNU Hurd was a great idea executed poorly. QnX is that same idea executed in an excellent and pragmatic way.


But qnx is almost universally hated by people actually working with it :)


I've developed for it and I thought it was pretty nice. Sure, the kill command is called slay, and other oddities, but at its core it is a very well-designed system.


> Things are beginning to look a bit more like microkernels as time goes on.

Hypervisors typically are microkernels. The very first "hypervisor" was L4 in fact, with L4Linux.

Also, you're right that virtualization and containers are fulfilling the promises of operating systems. The problem of earlier OSes is that they didn't take security seriously enough. Modern hypervisors aren't much better at isolation than real microkernels like L4, KeyKOS and EROS. They're better than containers though.


> The very first "hypervisor" was L4 in fact, with L4Linux.

The term "hypervisor" was coined by Popek and Goldberg in 1974. L4 didn't appear until the 90s.


> Hypervisors typically are microkernels.

Except they have vastly different histories (look up IBM's VM) and uses and underlying technologies.

Here's a good post on the difference:

https://utcc.utoronto.ca/~cks/space/blog/tech/HypervisorVsMi...

> Microkernels are intended to create a minimal set of low-level operations that would be used to build an operating system. While it's popular to slap a monolithic kernel on top of your microkernel, this is not how microkernel based OSes are supposed to be; a real microkernel OS should have lots of separate pieces that used the microkernel services to work with each other. Using a microkernel as not much more than an overgrown MMU and task switching abstraction layer for someone's monolithic kernel is a cheap hack driven by the needs of academic research, not how they are supposed to be.

> (There have been a few real microkernel OSes, such as QNX; Tanenbaum's Minix is or was one as well.)

> By contrast, hypervisors virtualize and emulate hardware at various levels of abstraction. This involves providing some of the same things that microkernels do (eg memory isolation, scheduling), but people interact with hypervisors in very different ways than they interact with microkernels. Even with 'cooperative' hypervisors, where the guest OSes must be guest-aware and make explicit calls to the hypervisor, the guests are far more independent, self-contained, and isolated than they would be in a microkernel. With typical 'hardware emulating' hypervisors this is even more extremely so because much or all of the interaction with the hypervisor is indirect, done by manipulating emulated hardware and then having the hypervisor reverse engineer your manipulations. As a consequence, something like guest to guest communication delays are likely to be several orders of magnitude worse than IPC between processes in a microkernel.


> Except they have vastly different histories (look up IBM's VM) and uses and underlying technologies.

Microkernels over time have also employed vastly different technologies. All the "differences" you note are "intended system design/interaction", which frankly isn't meaningful. For instance:

> By contrast, hypervisors virtualize and emulate hardware at various levels of abstraction. This involves providing some of the same things that microkernels do (eg memory isolation, scheduling), but people interact with hypervisors in very different ways than they interact with microkernels.

Who cares how they interact with it? A microkernel is defined by the sorts of abstractions it provides and the isolation properties those abstractions entail. Hypervisors are effectively less expressive microkernels.


>"Using a microkernel as not much more than an overgrown MMU and task switching abstraction layer for someone's monolithic kernel is a cheap hack driven by the needs of academic research, not how they are supposed to be."

I found this interesting. Can you or anyone else say what the context was where academic researchers have needed to do this? What problem was it solving for them in a cheap way?


There is a linked article on his blog https://utcc.utoronto.ca/~cks/space/blog/tech/AcademicMicrok... that expands on this. Specifically:

>the whole 'normal kernel on microkernel' idea of porting an existing OS kernel to live on top of your microkernel gives you at least the hope of creating a usable environment on your microkernel with a minimum amount of work (ie, without implementing all of a POSIX+ layer and TCP/IP networking and so on). Plus some grad student can probably get a paper out of it, which is a double win.


This is a great read. Thank you.


Isn't the important thing about KeyKOS and EROS capabilities, and L4 variants aren't all capability-based, are they?


I believe it's because at some point the operating systems (particularly Linux based ones) became semi-closed ecosystems of software, instead of a platform to run other people's software. Also became a 'singleton' mentality too, where the only one version or variant of an app/library is accomodated.

So now we're falling back to the few stable ABIs we have available -- Linus's kernel ABI (containers) and some kind of x86 platform virtualization. I think it's a little sad that this happens because operating systems made a poor job of doing what we needed them to do.


This is simply a side effect of ill-disciplined programmers breaking ABIs whenever they feel like it. It's quite possible to maintain ABIs in the long term and upgrade and deprecate them thoughtfully, and if you do that then you don't need multiple versions of a library or service around.


This is totally true, but I think it's the opposite approach that would have actually made a difference here. Saying "we know these ABIs won't be stable, either practically, or by design, so this is where the OS ends"

Then Linux package managers would have made it easy to eg. install local (eg. per user, or in a folder) libraries or software.

Of course it was always possible, setting various $PATH variables etc. but because it wasn't easy (and you're immediately out of the ecosystem) all focus has been on the root user and singleton approach. Static linking was put to one side too.

In doing so we made the presentation of the "operating system" a full ecosystem of libraries, too complex to be a defined layer to interface with. But the kernel realised the importance of stable ABI layer, so now we have containers instead.


I install pip dependencies for python into a vendor folder. It is definitively the cleanest solution compared to dicking around with virtualenv


This is actually a great explanation of how/why we got virtualization rather than microkernels. The hardware presented an ABI that couldn't be circumvented. The problem with software interfaces is that they're soft. With h/w we use a 'good enough' interface for possibly too long before it gets replaced as where in software we decide to upgrade the as-yet-not-widely-adopted interface with a better one and repeat.


If the POSIX committee couldn't deliver a solution that was modern and backwards compatible, what hope do we have that developers will selforganize and ever do that?


POSIX committee doesn't deliver anything.

POSIX is just a common API across UNIX clones.

And besides the AT&T original design, UNIXes have disagreed in almost everything else, hence UNIX wars.


Q1. What is the Portable Application Standards Committee (PASC)?

The IEEE Computer Society's Portable Application Standards Committee (PASC) is the group that has and continues to develop the POSIX family of standards. Historically, the major work has been undertaken within Project 1003 (POSIX) with the best known standard being IEEE Std 1003.1 (also known as POSIX 1003.1, colloquially termed "dot 1"). The goal of the PASC standards has been to promote application portability at the source code level.

http://www.opengroup.org/austin/papers/posix_faq.html


Examples of APIs actually proposed by POSIX, that weren't already in widespread use before adoption by standard updates?


I think you're just making my point.


Not really, last serious revision was in 2008, or so.


Yeah all those stupid fools... who needs machine VMs, language VMs, processes, process namespaces... Just compile your software with pointer authentication and run it in kernelspace on bare metal. Why is everyone so stupid and I am the only smart person on this planet?

... If you haven't realized it yet. I'm not serious at all ...


"Containers" is an unfortunate term, since it really better describes the container image than an actual running process with API virtualisation.

I think VMs-as-containers is where we'll wind up. The container image has turned out to be the real thing of interest, the runtime is almost secondary. Virtual machine systems have closed the performance gap in a variety of ways.

For example: tearing out kernel checks for devices that will never be connected to the VM; taking advantage of hyper-privileged CPU instructions; being able to make and restore higher-fidelity checkpoints than an OS can for faster launches etc.

At which point the isolation benefits of VMs really begin to outweigh everything else. A hypervisor has a much smaller attack surface and has a much simpler role than a full monolithic OS kernel. It partitions the hardware and that's about it. It doesn't exist in a constant tension between kernel-as-resource-manager and kernel-as-service-provider.


This is kinda where VMware is going with project pacific & vSphere integrated containers; containers running as individual VMs on a hypervisor. I wonder if we will see others following the same pattern?

I’m not sure what the drawbacks might be though


I think yes. Red Hat and others are working on KubeVirt, plus I believe there are CRI implementations for gVisor and Firecracker.

Disclosure: I work for Pivotal, which is in the middle of acquisition by VMware.


Yes and no, there's the attack surface of the guest operating system.


I'm quite hopeful for the limited re-introduction of hypervisors to "containment", if only because I've become disappointed with the lack of power app authors have to specify fine-grained, strict security details. This is because most of the fun lockdown toys that linux has, selinux, seccomp et al aren't easily "stackable", and those toys have been used by the orchestrator to perform the containment. And that's great, but it means containers all end up with broad, generic policies (that only care about container breakout) which can't be further restricted by app/container authors.

My hope is that lightweight virtualisation will give back the ability for app authors to tighten their container's straight jackets. Personally I've got my eye on kata containers.


I have a Xen hypervisor at home, running on a (well-configured) NUC. VMs boot in about 15 seconds. I'm patient!

I'm not doing devops stuff; I no longer code, so I don't need a testing pipeline.

I looked into containers based on LXC, when it was first introduced. I decided to stay away - I don't want to get tied into Poettering's code. Yeah, I'm running systemd on some of the VMs, but you don't really have much choice nowadays. I still use Debian, but I'm bitter about their adoption of systemd - and my hypervisor machine runs SysV Init, because I know how that works.

I never mixed it with Docker. From what I've been reading, Docker is already old-hat (it's only about 3 years old; how did that happen?) Kubernetes seems to be the thing nowadays. I don't even know how to pronounce "Kubernetes".

What was wrong with LXC? Like, LXC comes with the OS. Nothing to install. Why do people love these 3rd-party container engines?

And "wrappers"? what purpose is served by container wrappers?

Serious questions, I'm not trolling. Promise. I'm just a bit out-of-date.


Having used both LXC and Docker, the main advantage of Docker is using it as a package manager, with the ability to extremely easily start from a 3rd party container and layer your own changes on top. While I'm sure LXC could technically achieve the same, it is way more difficult to do so than docker's 'FROM 3rdparty-container:$VERSION' / 'docker pull 3rdparty-container:$VERSION'.

On the other hand, Docker for a long time didn't really care about security, which LXC was much better at since the beginning more or less.

As for Kubernetes, that is most often used in conjunction with Docker (though there are alternatives), if you want to deploy containers to a cluster of machines (VM or physical) and get rid of some of the administrative work for doing so.


Some of us have problems that are actually well solved by these tools and run on more sophisticated hardware than our home computer... It doesn't sound like you've hit that set of problems.


Yeah, thanks. It's a home computer because I don't work any more - I set-up and operated a continuous production VM pipeline on my boss's rack, back when I had a boss.

Docker had turned up; I evaluated it (quite briefly - I had work to do, and nobody had asked me to evaluate Docker). We had fairly specific requirements that I couldn't match to Docker.

I recall now that Kubernetes is principally a deployment system - sorry. Perhaps it is well-integrated with some container systems, such as Docker. Kubernetes didn't exist when I had a boss. I use Ansible.

I get that LXC is not very friendly. But it works OK, doesn't it? Just wrap a shiny skin around it. Don't re-invent the wheel; we already have enough wheel inventions.

I'm not by any means knowledgeable about containers; I evaluated LXC way back when, and decided that VMs were more secure. And never revisited that decision.


It is popular because of the developer experience. You can get a Postgres database running with one command, and then throw it away (and no this is not equivalent to apt install postgres). Many other things are pretty polished as well, definitely more polished than LXC.

Kubernetes and similar software "solve" some problems and create a bunch of new ones.

I don't think the actual tech is that good (or new for that matter, Solaris had zones in ~2005); for example getting a core dump from a container without going into the host machine is an unsolved problem.


Docker achieved a number of things: they provided a simple user experience and a simple, relatively effective image format. The primitives had been in the kernel for some time and were widely used to implement a variety of systems (Cloud Foundry used these kernel features over 2 generations of its own code before switching to runc).

If you're just running stuff on a single machine for your own satisfaction, you don't need to fret about Kubernetes.

Disclosure: I work for Pivotal, so I've been up close with some of this stuff for a bit.


"you don't need to fret about Kubernetes"

Don't know why you got voted down.

I'm retired, but I want to keep my hand in. I like to be able to throw up a server quickly. I like to be able to automate testing and deployment - Ansible is my friend. But I don't have to throw up and configure 100 servers by 10:00am :-)


> The Docker engine default seccomp profile blocks 44 system calls today, leaving containers running in this default Docker engine configuration with just around 300 syscalls available.

...preventing devs/ops people to run tools like iotop, unless extra capabilities are added.

I'm all in for containers, cgroups/namespaces but at the moment it's namespace isolation for the price of less features. Unless namespaces become first-class citizens in the Linux kernel, it will always be more efficient to just run on VMs or even Bare Metal. At least for non-planet scale workloads. :-)


> ...preventing devs/ops people to run tools like

This is because docker makes the fundamental mistake of conflating packaging with isolation. "Packaging" is achieved in docker by using the OS to do an amount of sandboxing and then letting the user perform whatever non-reproducible crap they like before balling the whole thing up and calling it a package.

If instead you make an app author actually figure out what their dependencies are and how to fetch/build them - in a system such as Nix, you not only get reproducible packages, you also get to decide to apply actual os-level isolation on a case by case basis - a developer doesn't necessarily need/want these barriers on their dev machine.


So a system like Android's? Honest question, I don't know of that's a good model to exist in general purpose Linux systems. There's also Fuchsia but I'm not sure if it's POSIX.


> So a system like Android's?

Not really, I'm essentially describing nix/guix.


This is wrong on a fundamental level. Containers are nothing more than regular processes that are launched leveraging some of the kernel’s built-in namespacing features.

When we talk about applying seccomp profiles to containers it just means applying them to the process — exactly how the rest of your system uses them. They are about limiting what the process itself can do, not you as an admin. Denying the ability for sshd to run iotop with SELinux doesn’t stop you from running it.

Running containers on a bare-metal host is exactly the same as running processes on a bare-metal host.


Namespaces _are_ first class citizens in the kernel.

Have you tried any solutions that are not based on docker?


> ...preventing devs/ops people to run tools like iotop, unless extra capabilities are added.

you run those on the host, not inside containers


So you have to give people access to the entire host! Great. And you can't even do that in cloud environments.

And even then you run top on the host and obviously it doesn't know anything about containers because they're not real.


I see container systems, such as Docker, as a packaging system more than anything else.


A Turing complete, language agnostic one.

If you were trying to write one that wasn’t container focused, how would that look?


https://nixos.org/nix/

https://guix.gnu.org/

Containers are a very poor substitute for package managers.


It would look like NixOS.


Just waiting for the new re-discovery of hypervisors, but with better marketing names.


"Workload Orchestrators"

K8s can already do VMs with Kube-virt, so yeah.


K8S is unnecessarily complicated. I fully expect "serverless", warts and all, to take all comers. And, I get the irony. It's basically cgi-bin 2.0. It will win not because it is better, but because it is better "understood".


The irony is that now we have Azure services to keep state across serverless requests.

Excuse me while I go implement a servlet over there....


Serverless has been around now for several years and it hasn't taken off yet... definitely not to the same degree containers have.

I'm skeptical it will. To adopt serverless you need to be willing to rearchitect your product and retool your developers... that's expensive.


>(Serverless) I'm skeptical it will.

They'd get more uptake if it was easier I think. Not re-architect your product...but small things here and there.

I wanted to play with azure python functions but despite vs enterprise & lots of credits I can't. Without admin rights on local machine it's basically impossible. (Need VSCode & AZ toolkit)

(Unrelated - that kinda blew my mind - no you can't do that in the 2,000 USD VS enterprise...you need to use the free one)


> (Unrelated - that kinda blew my mind - no you can't do that in the 2,000 USD VS enterprise...you need to use the free one)

I’m guessing this is because Microsoft wants it to be more accessible—they probably realize that there isn’t much money to be made in $2,000 developer tools. Visual Studio was never a Python IDE; VS Code is much more language-agnostic.


Sure, but the whole "this feature is impossible in our 2000usd suit but its possible in our free one" doesn't seem strange to you?

If I'm buying the top end product I'm expecting full feature set, no?


containers are still relatively new and definetly not as ubiquitous as virtual machines.

Virtual machines are everywhere, from SME to large enterprises, tech hubs, industrial software et al.

Containers are mainly used for modern web and app development.


The rabbit hole goes deeper than that, my friend. https://cloud.google.com/run/


Oh God, why? Btw, this uses gVisor (https://gvisor.dev/) as the container runtime (instead of runc/containerd). That seriously limits what syscalls the containers can leverage. And the pricing is complicated too.


CGI-bin 2.0 is actually quite neat: having your script kick off on as many machines as needed is something CGI-bin 1.0 never cracked on any significant scale, and (eventually) having something like a standard interface to do that on Someone Else's Datacenter will be fun.


> And, I get the irony. It's basically cgi-bin 2.0.

More like "fancy inetd."


Serverless servers?


Bare-metal containers.

No, that was even worse.


Federated subsystem authoritator


The future is probably WASI: a sandboxed compile target with a capabilities based security model not owned by a single corporation. https://wasi.dev


Is there still any place in 2019 for "system containers" like LXC/LXD?


They are quite useful as "VMs at the speed of application containers".

When you use application containers, you tend to create a more complicated setup and lose visibility as to what's going on in the system.

Just like you use VMs on a baremetal server, you use system containers in a VM.

But I find more important to use LXD as a tool for software development and also for Linux desktop use.

If you want to setup nodejs or something similar, it makes sense to put it in a LXD container so that it does not mess up your host. Each step of the workflow (creation, deletion, etc) takes a few seconds or less.

You can also setup LXD containers to run GUI applications. It makes big sense if you want to run games (like Steam) so that your desktop does not get polluted with i386 packages. You can also use for development tools, such as those from JetBrains, Android Studio and more.


They may not be mainstream but I use them as isolated VM-like application environments, where everything “just works” without having to learn/apply a whole lot of new tools or workflows.

I’m sure there must be others who see the benefit of this approach too?


They're certainly very useful for CI systems.


Key point: AWS Firecracker does NOT run on AWS.

Unless you want to pay for bare metal instances.

AWS Firecracker DOES run on Google Cloud, Azure and Digital Ocean.


Firecracker is used as the runtime for Lambda, I believe.


For the record, hypervisor were introduced almost before operating systems. In the begining it was just a switch to partition memmory and run multiple jobs on a big expensive machine.

So this is not the first time hypervisors are making a comeback.


Wait....this isn't new - hypervisor-based container isolation has been in Windows Server since 2016 (it's called Hyper-V containers).


Did the Windows Server isolation bundle a slim kernel?

One of the core elements of this for the Serverless fast boot requirement is using a kernel that has legacy modules excised.


So I read something succinct a while back - docker et al are distribution platforms not security platforms . They add "0" security against an adversary (think padlocks). How true is it ? And what high perf securely contained systems are there? OpenVZ?


Sandstorm has had a pretty good track record [1]. Then again, it was designed and run by capability security folks, who take security pretty seriously.

[1] https://sandstorm.io/


Unfortunately it is shutting down.


As a hosted service, yes. You can still self-host though.


I think the more correct statement is that “containers” (i.e processes running with some kernel namespacing features) don’t replace the need for existing security tools like seccomp, SELinux, apparmor. Namespacing is just another tool in your arsenal to help create some logical separation between security domains but you still likely need more to be sure that the separation is enforced.

“Docker”, the tool that is slowly acquiring the ability to natively use all of these security tools and apply them to the containers it launches is/will be a security platform.


The generic issue seems to be that stuff like containers can be escaped with pretty much any privilege escalation exploit ... and such exploits are reasonably common in the world of Linux.


So either take perf hit or don't expect isolation at all?


For completely untrusted workloads basically - yeah. For semi-trusted, there’s lots of tech that provides reasonable, lightweight isolation. There’s no reason why hardware vendors cant ship products that are both virtualizable with high performance and secure, so that may still come.


But if the container is unprivileged?


I’ve been thinking about these problems for a while. Previously, I thought that the “put a VM on it” approach was the right one. In 2015, I wrote novm [1], which I think served as inspiration for some developments that followed. My thinking has changed over the years and I actually work on gVisor today (disclaimer!). I’d like to share some thoughts here.

Hypervisors never left. They are a fundamental building block for infrastructure and will continue to be.

The question is whether there will be a broad shift to start relying on hypervisors to isolate every individual application. In my opinion, just wrapping containers in VMs is not a solution. (Nor do I find it technologically interesting, but that’s me.) I agree that the approach addresses some of the challenges of isolation, but is one step forward, two steps back in other ways.

Virtualizing at the hardware boundary lets you do some things very well. For example, device state is simple, and hardware support lets you track dirty memory passively and efficiently, so you can implement live migration for virtual machines much better than you could for processes. It can divide big machines into fungible, commodity sizes (allowing applications from having to care about NUMA, etc.). It lets you pass though and assign hardware devices. It gives you a strong security baseline.

But abstractions work best when they are faithful. Virtual machines operate on virtual CPUs, memory and devices, and operating systems work best when those abstractions behave like the real thing. That is, CPUs and memory are mostly available, and hardware acts like hardware (works independently, interactions don’t stall).

Containers and applications operate on OS-level abstractions: threads, memory mappings, futexes, etc. These abstractions are the basis for container efficiency — not because startup time is fast, but because these abstractions allow for a lot of statistical multiplexing and over-subscription while still performing well. The abstractions provide a lot of visibility for the OS to make good choices with global information (e.g. informing the scheduler, reclaim policy, etc.).

A problem arises when you decide that you want to bind single applications to single VMs, and then run many VMs instead of many containers. Effectively, the abstractions that you expose are now CPUs and memory, and these just don’t work as well for over-subscription and overall infrastructure efficiency. There’s no shared scheduler or cooperative synchronization (e.g. in an OS, threads waking each other will be moved to the same core), there’s no shared page cache, etc.

There are other problems too: virtualization gives you a very strong security baseline, but you have to start punching significant holes to get the container semantics you want. E.g. the cited virtfs is a great example: it’s easy to reason about the state of a block device, but an effective FUSE api (and shared memory for metadata) is a much larger system surface. The hardware interface itself is not a silver bullet. Devices are still complex (escapes happen), and the last few years have taught us that even the hardware mechanisms can have flaws. For example, AFAIK Kata containers is still vulnerable to L1TF unless you’re using instance-exclusive cpusets or have disabled hyper-threading. (Whereas native processes and containers are not vulnerable to this particular bug.)

The “put a VM on it” approach also may not have the standard image problems that plain hypervisors have, but you’ve got portability challenges. It seems non-ideal that a container isolation solution can run in infrastructure X and Y, but not in standard public clouds or your on-prem VMWare hosts, etc. (There might be specific technologies for each case, but that’s rather the point.)

That’s my 2c. I’m pretty optimistic that we can have strong isolation while still preserving the efficiency, portability and features of container-based infrastructure. I like a lot of these projects (especially the ones doing technologically interesting things, e.g. nabla, x-containers, virtfs, etc.) but I don’t think the straight-up “put a VM on it” approach is going to get us there.


Hi, completely agree with all of this. In fact, we've been focusing on the problem you mention about needing FS holes for VMs to regain container semantics (https://www.usenix.org/system/files/hotstorage19-paper-kolle...). Just in case, these are some of the container semantics we care about: FS crash consistency, file sharing (write+write), and efficient use of memory due to having a single page cache. The key question is: what's the smallest hole we can poke (smaller than allowing every single FS operation in the host)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: