Skip to content

Instantly share code, notes, and snippets.

@rchowe
Last active August 5, 2024 03:20
Show Gist options
  • Save rchowe/1b4e763e7881a7a7edb9aa0a564b933e to your computer and use it in GitHub Desktop.
Save rchowe/1b4e763e7881a7a7edb9aa0a564b933e to your computer and use it in GitHub Desktop.

CoCore Network Architecture

The CoCore architecture consists of:

  • A User (e.g. you sitting in front of your computer wanting to know the result of 4 + 9).
  • A Control Server owned by CoCore.
  • One or more Worker servers that may have shared tenancy (e.g. Contoso Inc. isn't using all of the resources on their file server and leases the spare compute time to CoCore).

A successful request flows through the CoCore architecture roughly as follows:

sequenceDiagram
    User->>Control Server: Run Function A113 with arguments XYZ
    Control Server->>Worker: Run the Docker container for A113, arguments XYZ
    Worker->>Control Server: Started Execution B999
    Worker-->>Control Server: Stream Output Chunk 1
    Worker-->>Control Server: Stream Output Chunk 2
    Worker-->>Control Server: ...
    Worker->>Control Server: Execution B999 is Complete
    Control Server->>User: Execution B999 of A113 is Complete, output is [...]
Loading

This diagram introduces two new terms of art:

  • A Function is some code provided by the user, e.g. print 'Hello, world'
  • An Execution is a single run of that function, potentially with different arguments. A function will likely be executed many times.

When a user requests that a function is run, the control server creates an execution record to track the invocation and assigns it to a worker. The worker then communicates with the control server via web sockets to provide (a) a heartbeat indicating that the function is still running and (b) in-progress updates to stdout and stderr.

Once an execution is created, it can move between its states (as represented by the state column in the database) in the following ways:

flowchart LR
    open[Open] -->|Worker gets task| running
    running[Running] -->|Execution Success| success[Success]
    running -->|Execution Failure| failure[Failure]
    running -->|Execution Timeout| timeout[Timeout]
    running -->|User Cancels| cancelled[Cancelled]
Loading

All of the transitions are driven by the worker notifying the command server of the change, except for user cancellations for which the command server has to tell the worker to cancel the task.

CoCore Worker Architecture

A CoCore Host Server is a third party-owned machine which leases compute time to CoCore. The current architecture that we have installs the CoCore Daemon on the host machine. The daemon starts a Firecracker VM, which has an init script that establishes a WebSocket connection to the control server and listens for tasks; when it gets a task it runs it with bash or docker.

flowchart LR
  subgraph Host Server
    daemon[CoCore Daemon] -->|Starts µVM| firecracker
    subgraph firecracker[Firecracker/Jailer]
        manager[In-VM Manager]
        manager -->|Starts| bash[bash -c task]
        manager -->|Starts| dockertask
        subgraph docker[Docker Container]
            dockertask[RUN task]
        end
    end
  end
  control[Control Server] --- |WebSocket Connection| manager
Loading

The isolation provided by Firecracker (and Docker) generally only serves to prevent access to resources outside the container from things inside the container and not the other way around (don't get me started on security; I have somewhere else I'm going with this). Based on this architecture, it may be possible for executing tasks to access (and potentially alter) the In-VM Manager and other executing tasks -- not great for a Lambda-like service!

We also do not take advantage of the main benefit of Firecracker with this architecture: the fast startup time due to a limited emulation. I believe we may want to consider an architecture more similar to AWS Lambda, which looks like this:

flowchart LR
  control[Control Server] --- |WebSocket Connection| daemon
  subgraph Host Server
    daemon[CoCore Daemon and Manager] -->|Starts µVM| firecracker1
    daemon[CoCore Daemon and Manager] -->|Starts µVM| firecracker2
    subgraph firecracker1[Firecracker/Jailer]
        subgraph docker1[Docker Container]
            dockertask1[RUN task]
        end
    end
    subgraph firecracker2[Firecracker/Jailer]
        subgraph docker2[Docker Container]
            dockertask2[RUN task]
        end
    end
  end
Loading

(supposedly it looks like this; Amazon doesn't share the implementation details)

In this case, the CoCore Daemon and task manager runs on the host server with no isolation, however since the daemon code is managed by CoCore, we can ensure that code is well-behaved. The potentially ill-behaved customer code is isolated within the Firecracker / Jailer microVMs as well as the Docker container within the microVM, and cannot touch the CoCore manager that started it or any other running tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment