Service Oriented Architectures (SOAs) have been the meat and potatoes of enterprise solutions for over a decade. Their influence on software engineering methodology has been unmistakable, although applying those methodologies to the development of newer distributed processing clusters can result in painful mismatches. This article details the differences between the two architectures, and describes process changes that could be implemented in order to account for those differences.
An SOA’s workflow involves receiving a message from a network, doing something with it, and then either passing the results back or passing the results to the next service in line. The typical request is no more than 2k in size, whereas the executable that handles this request starts out at a gigabyte in size, increasing with containerization, a full server image, and code bloat.
By definition, Big Data’s primary use case handles data sets that are too large to fit on a single server. This puts the size of the data that passes through a single execution at 64GB or larger. The interfaces for input, output, and configuration management are part of the infrastructure, so the executable tends to be smaller than an equivalent SOA executable.
The data size shift identifies the primary difference: for an SOA, the data is much smaller than the executable, so an efficient process sends the data to the executables. In BD, the data is much larger than the executable, so we send the executable to where the data is. All BD implementations will send an executable to the node where the data is stored. This is a trait referred to as data locality.
Data locality requires that the organization have a continuously allocated set of resources for data storage, and those storage nodes must have capacity to process the data. In a BD cluster, the nodes will usually pack both massive storage and significant data processing capability, i.e. lots of disks, cores, and RAM.
Data locality is the key to understanding the difference in how the two architectures scale. With an SOA, you can always spin up new nodes and alter your load balancing rules to include new resources. SOA clusters are often designed to adjust the number of virtual machines assigned to a specific task based on the current load. This is one of the appealing traits of cloud hosting — you only pay for the processors you use. With BD, spinning up new hosts results in hosts with no data on them, which ruins data locality.
Fortunately, BD jobs aren’t scaled across more processors as their load increases; they’re scaled across more time. This brings up a second major difference: time domain. An SOA handles every packet as it comes in, allowing for temporal reliability in the tenths of a second. They often have a human sitting in front of a screen waiting impatiently for them to respond. With BD, it is rarely expected for an end result to be available in less than half an hour.
This is the nature of batch processing in a BD cluster like Hadoop. Instead of waiting like a cat to pounce upon every event, a BD cluster will wait for events to build up and run them all through at once, therein gaining efficiencies in scale. When a period’s data gets larger, it just takes a little longer to process. As long as you can process each time segment’s work in about half of that time period, you can run an efficient process with room for surges.
Batch processing and data locality combine to fundamentally change how instances of an executable are provisioned. While a dynamically scalable SOA will check the load every few minutes and spin up or tear down nodes as needed, a BD cluster will create hundreds or thousands of instances of an executable in less than a minute, run all the data through them, and then free up those instances when the batch is complete.
If a Hadoop cluster tried to spin up a VM for every instance, provision an OS, confirm functionality, load the server, and then run it, the overhead would make BD impractical. Instead, a Hadoop cluster uses the resources required by data locality to set up a standard set of services that all executables can and should use. As mentioned previously, this mostly involves file and configuration access, but it also involves a system that tracks the availability of processors and ram so that it can decide which machine in a cluster has capacity for each load.
You could think of a BD system in terms of a Gatling gun. The multiple spinning barrels all share a single feeding and firing mechanism, as opposed to each barrel having its own. Similarly, a BD executable only needs to receive the data, process it, and then pass the results back to the subsystem, without having to worry about details like IO and process management. This specialization results in smaller, simpler executables.
The flip side of this is that while you can easily replace an entire SOA cluster mid-flight by replacing the instances one at a time, replacing a BD cluster is a painful, time-consuming process. You can replace the jobs that run on it easily, but the cluster that they run on is an infrastructure consideration.
One thing that a BD infrastructure doesn’t provide is incoming network connections. A hard rule of BD clusters is that a BD instance is never a service. It’s always a client, and usually receives all of its input from, and sends all of its output to, a distributed file system like HDFS, a streaming service like Kafka, or a KV store like Cassandra.
Since BD clusters will create hundreds of instances that only live for a short period of time, sending data into them would require a load balancing scheme that allocated all of the addresses and ports, and then advertised this new location to whoever wanted to use the services. This immediately would bring up the question of “why aren’t we building an SOA instead,” so that’s what happens.
Accessing a relational database on a per-event basis from a BD process is a big no-no because it’s an easy way to overwhelm your database. The preferred method of dealing with this is to cache all the values you’re going to need before you start processing the data. On the update side, you want to aggregate any output for an RDB to a file where it can be fed in at a more sedate pace.
I’d like to highlight the implications of this difference. A large SOA infrastructure is kind of like a wild-west shootout, where every service has the ability to communicate with every other service. I’ve seen many performance-destroying schemes for keeping it all straight. In my experience, the only way to ensure that an uncertain process isn’t going to interfere with this olympic juggling competition is to have the SOA cluster run in complete isolation from the uncertain processes, where firewall rules prevent tests from accessing services that you don’t want them touching. In any industrial development process, this means having a production environment in which anything that runs in the environment must have passed through tests in several other environments before you let it in, kind of like an exclusive golf club. The productivity hit of such a development process is a necessary evil.
With a BD cluster, any job has a defined input and a defined output. They can’t receive spontaneous input from other processes, and they can’t write to the file system anywhere that they don’t have permissions. This means that unless you’re breaking some rule like hard-coding your database credentials, you can run a BD process as a different user, and file access rules will perform the same function as a hard firewall.
In SOA terms, every BD job is like having a testing facility that will spin up an entire isolated cluster, instantiate all of the servers, run all of the processes, and then clean up everything except the results when it’s done. I’ve seen development teams work for years to get something like that running for SOA’s by using Vagrant on VMs, and still need to coach every developer through their first few runs of it.
Experienced SOA architects will consider a Hadoop cluster to be an “environment,” and require numerous preproduction Hadoop clusters in order to satisfy an SOA’s need for isolation. As mentioned above, they consider the hardware and productivity cost to be a necessary evil. In reality, a Hadoop cluster is more like a data center, in which many environments can be housed. Each user identity embodies an environment, and can have its own isolated sources and targets.
This allows for concepts like A/B testing of pipelines. You can usually run two versions of a process on the same starting data, in the same cluster, simultaneously, without being concerned about interference, as long as you specify different output locations. If this thought horrifies you, then this article is for you.
The output of your Production A and Production B jobs can be compared to validate correct functionality. It also allows you to run a process under full load using real production data so that you can identify edge cases that would crash your process, without endangering your production pipeline. In a BD process, bad data and incorrectly processed data are a far greater threat to reliable processing than inter-process interference.
The purpose of containerization is to have a well-defined environment in which an executable runs. If you run only one service per piece of hardware, you often waste a large chunk of your hardware. If you run multiple services on each, then their configuration, libraries, and environmental variables can interfere with each other.
Virtual machines are an industry standard for this reason, even though they add a layer of complexity and creates a new opportunity for failure. VMs are also like a process tax, taking a percentage of your processing power, although being able to build a system of exactly the right size for your needs makes that mostly irrelevant. Lighter weight solutions like Docker alleviate most of the hardware requirements, but still add a layer of complexity (abet a much smaller one).
The previous section should have made it clear that attempting to distribute a VM as a container for a mapreduce job would be a bad idea. It’s tempting to think that Docker containers would work, but this also runs into problems.
Docker was designed with SOA’s in mind. They’re built to be long-running, and the service that coordinates and maintains Docker containers isn’t designed to create dozens of images simultaneously, and then clean some of them up before others have even started running. When my team attempted to use Docker containers for mapreduce tasks, it became a regular maintenance task to clean up old, dead containers. Additionally, the jobs would sometimes fail because the Docker network proxy didn’t play well with Ubuntu’s firewall.
When mapreduce jobs are initiated on a Hadoop v2 cluster, the default coordinator, Yarn, actually performs a type of containerization. It provides the job with a full configuration that should have all of the information that a job would normally receive from configuration files, environmental variables, and other configuration sources. The a mapreduce process should never have to look outside of that configuration for any of its operating information.
Java library coordination is a different problem. Since you never know which node an executable will run on, it’s entirely possible to have two processes that require different versions of the same library to be running at the same time. Hadoop handles this by preferring shaded jars that contain all of the libraries that the process will need in the jar itself.
Abstraction is an excellent tool for software engineers. This is so much true, that it is a common anti-pattern for us to apply multiple layers of abstraction. My favorite example of this is a Spring builder factory, which provides three layers of execution abstraction. Putting a container inside of a container is another example that exists largely because a developer doesn’t understand one form of abstraction, and so applies one that they do understand around it.
Firewalls and Networks
The nature of how information is shared within a Hadoop cluster also has implications for how your network should be structured, both logically and physically. With an SOA, you are encouraged to specify which services are listening on which ports at which IP addresses, and administrators are aggressive about blocking ports and IP addresses that don’t correspond to the services that a specific service will want to access.
If you were to map out which nodes of a Hadoop cluster should be able to communicate with which other nodes, you would find that you have a very small number of ports that can be blocked. While a BD instance should never be a server, there are many kinds of services that make BD possible. Each node runs a service to provide access to its storage resources and a service to provide access to its processing resources. Under Yarn, every job is started by running a process called an Application Master, which will then spin up child processes on data-local nodes to do the actual processing. The Application Master will claim a random port on the machine as its own (any port above 200) and all of its child processes will need to connect to it in order to report their status.
It’s the natural inclination for an IT professional to lock down a Hadoop cluster until it no longer works. It’s easy to do, and sometimes you get “lucky” and don’t notice that it’s happened for a few days. A better approach is to give each cluster its own subnet, and only allow certain edge nodes to communicate with anything outside of that subnet that doesn’t have administration duties. Given the inability to block inter-node traffic, this ring of fire is critical to security-conscious administrators.
The recommended network configuration echos this pattern. The common SOA idiom is to have everything connected by a single massive backplane and to use firewall rules to isolate functionality. The standard process for Hadoop clusters is to give them their own backplane, with limited connectivity to the outside world, so that they won’t flood the network on occasions when data locality isn’t available.
For me, one of the most surprising aspects of BD development was that you don’t need to worry about multithreading issues. The areas where most systems take advantage of multithreading are wait-states, caused by file system, database, or network access. The subsystems will perform IO caching on both the input and output side, and RDB caching eliminates database delays. The client BD executable simply needs to wait for input, process it, and then pass it back.
There are many forms of input that require that events be read in the order in which they were written, so this is the default behavior for file-based input. Scattering the events across several threads would break that. Instead, the subsystem will allocate one process per file, or one process per block of a file, and let a single thread tear through it without interruption. Multithreading is achieved by creating more instances, which are allocated by available hyper-threaded cores.
The biggest takeaway from my experience is that the base processing of big data is actually very simple. It receives input from one side and writes it out the other, and the logic an instance does in between is easily testable with simple unit tests.
The difficulty comes from making sure that an executable accesses the subsystems correctly. Connecting to the file system, loading the compression codecs, and accessing databases for configuration and caching are where most of the problems are experienced. These are not concerns that you can test in isolation. In practice, the smallest system on which you can effectively test BD code is on a full BD cluster. If you test BD code on a test cluster, then you still have no guarantee that it’s going to work the same way on a production cluster. The configuration of the full ecosystem is too complex to guarantee that you’re doing it exactly the same way every time. Configuration systems like Puppet, Chef, and Ansible do little to avoid this problem because you can’t tear down and re-initialize the entire cluster every time you adjust the cluster’s tuning.
Fortunately, with the way BD executables run, you can guarantee isolation of a single run. This means that you can make the final test run of an executable on the same cluster in which your production executables run without breaking your production pipeline. In fact, because of the environmental complexity issues, you will always wind up testing a BD executable in production — it’s just a question of whether or not you do so in the middle of its release.