How To Scale Big Data Environments

Big Data Environments

The rapid forward pace of improvement of technologies like artificial intelligence, machine learning and Internet of Things has had a great impact in the tech industry. This can be felt in the way of better predictive applications, better natural language processing and, of course, self-driving cars.

While most companies in the world won’t find themselves dealing with such complex use cases, we have to contend to at least one simple fact: the amount of data we produce is growing. Statista forecasts that the industry will have grown to a whopping $64 billion by 2021.

Big Data presents interesting opportunities for new and existing companies, but presents one major problem: how to scale effectively.


Scaling for Big Data is Difficult

By definition, Big Data is unique in its sheer scale and size. When dealing with big data, you have control of incredibly complex systems that need constant care and maintenance. In the course of processing such data, numerous issues are bound to arise.

Dealing with big data has inherent complexities, such as:

•  Collecting data is expensive, and methods of collection are limited.

•  Sources of data are wide and varied.

•  Tools for collecting data exist, but they are usually specialized. The kind of input they accept and output they produce is very specific.

•  You need the right tools to transform the collected data into something useful.

•  Processed data has to be sent to multiple destinations.

•  Security concerns: if the post-2010 era is any indication, security breaches have been a mainstay in the tech industry. These need to be addressed before a project is moved over to production.

•  Complying with local and international regulation.

Last of all is the performance bottlenecks projects face when attempting to keep up with the new requirements. As a big data project gets larger, it’s inevitable that these issues get more widespread and more common, if the right steps are not taken to mitigate them. They might then translate into a myriad of legal and financial problems if not addressed.

These problems can be broadly divided into two categories, each calling for a unique solution – Software issues, where traditional software is no longer efficient and needs to be changed or upgraded; and hardware limitations, which can be solved through vertical scaling (scaling up) and horizontal scaling (scaling out).


The Software Problem: Increasing Query Complexity

Traditional ways of dealing with large sets of data are either via spreadsheets or through relational databases. The former of these is dead in the water before an argument can even be made for it, which leaves us with relational databases, mostly based on SQL.

To claim big data is too big for conventional databases would be a rather bold statement, and difficult to defend. This is mostly because of the hazy definition do what big data exactly is.

Your data doesn’t suddenly transform from ‘small data’ to ‘big data’ at a fixed point in time. A better argument to make would be that not even the best-performing RDBMSs have the capacity to handle terabytes of data at a time, and do so efficiently.

Instead, software made specifically for dealing specifically with big data problems are to be adopted. These include Hadoop, Spark and Hive. How these systems are to be integrated into current systems depends on the requirement of the business, but they can either be clustered or used in a single monolithic system with more data added on top.

If processing data were as simple as converting everything into easy-to-digest JSON files, big data would be a lot simpler. The basic truth, though, is that data needs to be cleaned and converted into different formats to accommodate different clients.

Further, some vendors prefer to deal with the problem simply by offering object stores: the data isn’t converted into any special format for storage, it’s kept as is. Most relational databases offer blob storage, but it’s not just as efficient as, say, Hadoop’s HDFS.

For data scientists that prefer to keep up with the times, data warehouses such as Google’s BigQuery or Amazon’s Redshift may hold more appeal. BigQuery, for instance, is a fully-managed solution, dubbed ‘big data as a service’ from Google.

All the complexity of running a dozen different clusters is taken away, as all you have to do is communicate with the API. Google does all the upgrading, checking for downtime and optimizing for speed on your behalf. The best part? None of it is your hardware.

This admittedly addresses hardware concerns also, since you don’t have to worry about dealing with multiple connected servers for your Hadoop network, for instance.

Lastly, it’s important to remember that data warehouses are distinctly different from relational databases by design. BigQuery can be used like transactional databases, but it will be incredibly slow. Rather, it’s to be used for the analysis bits, and the results can then be stored in your favorite SQL (or NoSQL) database.


The Hardware Problem: Horizontal And Vertical Scaling

The second phase of bottlenecks companies will normally face is general hardware limitations.  These problems, when encountered will negatively affect your organization by reducing efficiency, disrupting your workflow and lowering customer satisfaction.

It’s important to remember that even after making software upgrades, hardware problems are going to persist. As a matter of fact, new applications call for upgrading hardware almost by default. Transactional databases by themselves can be very memory hungry, however, the amount of resources they need is comparatively dull when stacked up against Hadoop or Spark.

The most common performance bottlenecks that will be experienced are as follows:

CPU bottlenecks: This is the most common bottleneck with expanding big data projects, and usually the easiest to detect. A CPU bottleneck occurs when too many demands are made of the CPU for it to keep up.

Slowing performance is the key indicator of CPU usage that’s through the roof, and will often result in dozens of issues. For instance, one component in your tech stack could fail while the rest keeps working, albeit slower than usual.

Memory limitations: The next common bottleneck experienced is an insufficient amount of RAM. RAM can be thought of as a container into which more and more data is poured. Once the container is full, you need another one if you want to deal with any more data.

Memory limitations are made more apparent with the sudden rise of underdog Spark from Hadoop’s shadow. It utilizes in-memory processing to achieve over three times as much speed as what Hadoop offers. If you are low on RAM, your application will crash.

On the other hand, high RAM usage might also be an indication of badly-written code. You might have to dig into the source, find the memory leak and plug it.

High Disk Usage: High disk usage is essentially running out of storage to keep all your data in. When relying on on-premises data stores, this will require physical addition of more HDDs or HDDs.


To scale up or to scale out?

Once the problem has been diagnosed, then comes the time to either scale up or scale out. Scaling up, more formally referred to as vertical scaling, involves improving your server’s hardware while scaling out involves using more than one machine. It’s basically a case of distributed computing vs shared memory processing.

To borrow a helpful analogy, scaling out can be thought of as ‘a thousand little minions’ helping you to do the work while scaling up is ‘having one giant hulk’ to do the heavy lifting.

Scaling up may involve adding more memory, storage or computational power when the need arises. Modern cloud providers such as Digitalocean, Linode and Google make this fairly easy. You just select the amount of additional processing power you need and the rest is done for you automatically in the backend. Vertical scaling may also involve software in more nuanced applications. This involves adding more threads, connections or cache sizes when dealing with big data.

It is advantageous because it uses less network hardware, so you won’t have to deal with issues such as bandwidth limitations and consumes far less power in the long run. For most organization, however, these benefits might turn out to only be short-term, especially as the company continues to grow.

Scaling out is a term commonly used in reference to distributed architectures, and can be dealt with in two ways: using a third-party service that’s already distributed or using additional servers to distribute the workload. At this point, vertical scaling can be achieved separately on each server, which can be incredibly advantageous for workloads that increase unpredictability, as a single node can be added or removed as required.

The downside is achieving such scalability requires a lot of management, skill and a lot more maintenance. Without sufficient manpower, even setting up and handling three or four Hadoop cluster is going to be too much work. At the same time, do keep in mind that such software has a large barrier or entry. Skilled manpower isn’t easy to come across.

The size of the company, its requirements and the resources it has at hand should all play a role in the decision of when to scale and if it should be done, to begin with.