Thursday, August 29, 2013

Distributed server: any DHT will do, right? Wrong.

After diving into DHTs a while ago, i first thought i had it all figured out: DHT is the name of the game when it comes to distributed servers, or, at the very least, they are an appropriate and mature solution for providing a distributed routing service. And apparently that is indeed the case, but with a caveat: all the common DHT algorithms presented in the literature are highly unreliable in a high-churn rate [residential] end-user-supported P2P network. More specifically, what all common DHT algorithms (that i know of) lack is on one hand enough redundancy to cope with the kind of churn typically found in end-user P2P networks (where users frequently join and leave he network, unlike in a network of long-lived servers), and on the other hand they are not sufficiently resilient to face the kinds of concerted attacks that can be perpetrated in a P2P network by a set of coordinated malicious nodes.

To make the long story short, the conclusion for all this was that building the P2P OS distributed server by simply copy-pasting an existing DHT algorithm is a no-go, and this sent me right back to square one: "now what?"

Well, the breaking news story-of-the-day is that i think i found a way to strengthen DHTs just enough to make them cope with the high churn problem, and, together with the obfuscated code-based "moving target defense" mechanism, i might now have a complete solution to almost all the potential problems i can foresee at this stage (specifically, there is one more problem that i'm aware of that is still outstanding, namely protecting against DDoS attacks, but apparently there are accessible commercial solutions for this one also; i'll talk about this in another post after i'll do some more digging)

Without getting into too many technical details at this point (primarily because all this is still in a preliminary stage, without a single line of code being written to actually test the algorithms involved), the main ideas for an "improved DHT" are as follows:
  • use a "network supervisor" server which, based on its unique global perspective over the network, will be responsible for maintaining a deterministic network topology, all while also keeping the network's critical parameters within acceptable bounds
  • add redundancy at the network nodes level by clustering several routers inside each node: in brief, having several routers inside a node, coupled with a deterministic routing algorithm (as enabled by the deterministic topology of the network), should provide a sufficient level of resilience to malicious intruders such as to allow the network to operate properly
Sure enough, the points listed above are just the very top-level adjustments that i'm trying to make to the existing plain-vanilla DHTs, but there are quite a lot of fine points that need to be actually implemented and tested before celebrating, e.g. the iterative routing algorithm with progress monitor at each step in the routing process, having multiple paths from one node to another supported by a backtracking algorithm, node state monitoring and maintenance by the supervisor server, etc - and these are just a few examples of the issues that i am aware of.

At the end of the day, when all pieces are put together the overall picture looks something like this:

So basically this is how far i got: i have this "supervised network" architecture which i think might be a solution for a sufficiently resilient and reliable distributed server, and i have the code obfuscation-based network integrity protection, but now i need to test these thingies the best i can. I definitely won't be able to test a large-scale system anywhere near a real-life scenario until actually deploying it in the wild, but a preliminary validation of its key features taken one by one seems feasible.

The network monitoring/maintenance algorithm, the node insertion/removal procedures, etc, are all pretty messy stuff that i still have to thoroughly double-check before actually diving into writing code -- e.g. here's a sneak preview for how a new node is inserted in, and announces its presence to, the routing ring:

  • the blue nodes are "currently" existing nodes positioned in an already-full 23-node ring (i.e. 000::, 001::, 010::, 011::, 100::,, 101::, 110::, 111:: in the image above, where '::' means all trailing bits are 0)
  • the yellow nodes encircled in solid lines are nodes that have already been inserted in the yet-incomplete 24-node ring (the yellow nodes are interleaved with the existing 23 blue nodes in order to create the new 24-node ring)
  • the red node is the node that is "currently" being inserted in the routing ring (more specifically, in the yellow nodes "sub-ring" at index 0111::, i.e. in between the [already existing] blue nodes 011:: and 100::)
  • the yellow nodes encircled in dashed lines are nodes that will be inserted in the [yet-incomplete] yellow nodes ring after the "current" insertion of the red node is completed
  • after the yellow sub-ring will be completely populated (i.e. there will be a total of 24 [yellow and blue] nodes in the routing ring), the routing ring will be expanded to 25 nodes by inserting new nodes in between the existing [yellow and blue] nodes of the 24-node ring, a.s.o.; i.e. the routing ring always grows by creating a new sub-ring of "empty slots" in between the existing nodes, and incrementally populating said empty slots with new nodes

Thursday, August 1, 2013

Been stuck for several months, but now i might be on to something

As i explained in an earlier post, there are several classes of internet connection that a user may have in the real world, but for the purpose of this discussion we shall simplify the categorization in only two [top-level] "meta-classes":
  • 'good' internet connections: these connections allow a peer to have direct P2P connectivity with any other peer on the network; and
  • 'leech' internet connections: these connections only allow two peers to connect to each other by means of a relaying peer, where said relaying peer must have a 'good' connection in order to be able to act as a relay
As it can be seen, any two peers with 'leech' connections will have to rely on a third-party relaying peer with a 'good' connection in order to be able to connect to each other.

In other words, there are real-world objective reasons that will prevent all peers from being equal on the network: 'leeches' will always require assistance from 'good' peers, while they will be truly unable to assist other peers on the network in any way (because of their objectively problematic internet connection)

The problem (that got me stuck for over four months):
In the real-world internet, the ratio between 'good' internet connections and 'leech' connections is (by far) sufficiently high to enable a cooperative self-sustained P2P network, i.e. there are enough 'good' peers that can provide relaying services to the 'leeches' upon request. HOWEVER, the very fact that there is a network contribution disparity between 'good' peers and 'leeches' can motivate some users to commit severe abuses that can ultimately bring down the network (if too many users become abusive): namely, a peer with 'good' connectivity might just decide it doesn't want to serve the network (by providing [bandwidth-consuming] relaying services to the unfortunate 'leeches'), and in order to get away with this unfair behavior all it has to do is to misrepresent its 'good' internet connection as being a 'leech' connection: once successful in misrepresented itself on the network as 'leech', it will not be requested to provide [relaying] services on the network.

So the problem can now be stated as follows:
how can an open-protocol P2P network be protected against hacked malicious clients which, because the network protocol is open, can be crafted in such a way that they will fully obey the network protocol syntax (and thus will be indistinguishable from genuine clients based solely on their behavior), but they will falsely claim to have 'leech'-type of internet connections that prevent them from actively contributing to the network. In brief, said malicious clients will unfairly use other peers' bandwidth when they'll need it, but will not provide [any] bandwidth of their own to the other peers when they'll be requested to do so, and they will get away with it by falsely claiming that they are sitting behind a problematic type of internet connection which prevents them from being cooperative contributors to the network (when in truth they are purposefully misrepresenting their internet connection's capabilities in order to make unfair use of the network).

The standard solution (which cannot be used):
The standard solution to the problem described above is to make sure that all the peers in the network are running a digitally-signed client program, which client program is a known-good version that a central authority distributes to the peers. However, once we dive into the details of how such a solution can be implemented we get into trouble: specifically, digitally-signed clients cannot be used in the P2P OS ecosystem because this would imply the existence of an [uncompromised] signature-validation DRM running on the peers' computers, which we cannot assume, because if we would make such an assumption we would only shift the problem of “how do we prevent compromised peers” to “how do we prevent compromised DRMs”, i.e. we'd only get right back to square one

A saving idea? (go or no-go, not sure yet):
A new way of protecting a known-good system configuration is the talk of the town these days, namely the "moving target defense" (a.k.a. MTD) [class of] solutions (apparently this concept - as opposed to the underlying techniques - is so new that it didn't even make it in wikipedia at the time i'm writing this), and for the specific case of the P2P network problem as i stated it above (i.e. resilience to maliciously crafted lying peers) the MTD translates into the following:
  1. have a central authority that periodically changes the communication protocol's syntax, then creates a new version of the client program which complies with the new protocol, and finally it broadcasts the new [known-good] version of the client program on the P2P network; in this way, the protocol change will immediately prevent ALL old clients, including the compromised ones, to log onto the network, and will require each peer to get the new [known-good] version of the client program as distributed by the central authority (i.e. all the maliciously-crafted compromised clients are effectively eliminated from the network immediately after each protocol change)
  2. the protocol changes that are implemented in each new version of the client program will be deeply OBFUSCATED in the client program object code (using all the code obfuscation tricks in book), with the goal of delaying any [theoretically possible] successful reverse engineering of the new protocol beyond the release of the next protocol update and thus render the [potentially cracked] older protocol(s) unusable on the network 
  3. the protocol obfuscator must be automatic and must itself be an open source program, where the only secret component (upon which the entire system security scheme relies on) must be the specific [random] strategy that the obfuscator elects to use as it releases each new version of obfuscated clients
As a result, after each protocol update the P2P network will only host known-good versions of clients, and by the time when any protocol reverse engineering effort might be successful, a new protocol update will already have been released, thus preventing any prior-to-the-update reverse-engineered clients to log onto the network.

The work ahead:
As it can be seen from the above description, the dynamic protocol update solution relies on the ability to create and distribute obfuscated program versions at a higher rate than an attacker's ability to create a malicious reverse engineered version of the program. Thus, given a system that uses the dynamic protocol adjustment method (as described above), the network integrity protection problem translates into the following problem:
[how] can a protocol be obfuscated such that the [theoretical] time necessary to crack the obfuscated code, given a known set of resources, exceeds a predefined limit?
Should the protocol obfuscation problem have a solution (probably underpinned by dynamic code obfuscation techniques) then the problem is solved (and i won't mind if it will be an empirical solution for as long as it proves viable in the real world) - so this is what i'm trying to find out now.

A few articles on/related to code and protocol obfuscation:

I also started a discussion on code obfuscation on comp.compilers, feel free to join here:!topic/comp.compilers/ozGK36DRtw8