Over the last decade, an entire ecosystem of technologies has emerged to meet the business demand for processing an unprecedented amount of consumer data.
Organizations that narrowly focus their attention on operational data are putting their businesses at serious risk.
How much data exists?
Every passing moment, the pace of data creation continues to compound. In the time it takes you to read these few paragraphs, more than 200 million emails will be sent, millions of dollars in e-commerce will be transacted, 50 hours of YouTube videos will be uploaded, millions of Google searches will be launched, tens of millions of photos will be shared, and Facebook likes will outnumber cars in the U.S. Every few minutes, this cycle repeats and grows.
Ninety percent of the world’s digital data was created in the past two years alone. In 2016, the gigabyte equivalent to all movies ever made traveled the reaches of the internet every 100 seconds.
Gigabytes are no longer a measure for volume. Nor are terabytes. Data volumes are rapidly moving past petabytes to zettabytes, and the vast majority of this data exists on the outside of the transaction and research systems of companies.
Your firm will wither away if you ignore your data
Getting the right data to the right people at the right time, whether internal or external, is the name of the game in today’s demanding marketplace.
Every company has to find a way to harness big data and use it to drive growth. If your organization isn’t talking big data, you are in the minority – and that puts you at a competitive disadvantage.
Educate yourself about the meaning of big data and essential basics through this 30,000-foot view of the evolution of big data and key components of the data discussion.
With this foundation, you can proceed to the next step – addressing what to do with your data and how.
What is big data?
Big data happens when there is more input than can be processed using current data management systems.
The arrival of smartphones and tablets was the tipping point that led to big data. With the internet as the catalyst, data creation exploded with the ability to have music, documents, books, movies, conversations, images, text messages, announcements, and alerts readily accessible.
Digital channels (websites, applications, social media) exist to entertain, inform, and add convenience to our lives. But their role goes beyond the consumer audience – accumulating invaluable data to inform business strategies.
Digital technology that logs, aggregates, and integrates with open data sources enables organizations to get the most out of their data, and methodically improve bottom lines.
The development of modern data architecture
Until recently, businesses relied on basic technologies from select vendors. In the 1980s, Windows and the Mac OS debuted with integrated data management technology, and early versions of relational database engines began to become commercially viable.
Linux came onto the scene in 1991, releasing a free operating system kernel that paved the way for big data management.
Distributed file systems
In the early 2000s, Google proposed the google file system, a technology for indexing and managing mounting data. A key tenet to the idea was using more low-cost machines to accomplish big tasks more efficiently and inexpensively than hardware on a central server.
Before the Information Age, data was transactional and structured. Today’s data is assorted, requiring a file system that is equipped to ingest and sort massive influxes of unstructured data. Open source and commercial software tools automate the necessary actions to enable the new varieties of data and its attendant metadata to be readily available for analysis.
Inspired by the promise of distributing the processing load for the increasing volumes of data, Doug Cutting and Mike Cafarella created Hadoop in 2005.
The Apache Software Foundation took the value of data to the next level with the release of Hadoop in Dec. 2011. Today this open source software technology is packaged with services and support from new vendors to manage companies’ most valuable asset – data.
The Hadoop architecture relies on distributing workloads across numerous low-cost commodity servers. Each of these “pizza boxes” (so called because they are an inch high and less than 20 inches wide and deep) has a CPU, memory and disk storage. They are simple servers with the ability to process immense amounts of various, unstructured data when running as nodes in a Hadoop cluster.
A more powerful machine called the name node manages the distribution of incoming data across the nodes. Data is, by default, written to at least three nodes and might not exist in its entirety as a single file in any one node. See the simple diagram above, which illustrates the Hadoop architecture at work.
Open source software
Open source software (OSS) is now used in the majority of enterprises today. From operating systems to utilities to data management software, OSS has become the standard fare for corporate software development groups.
Serving as a progressive OSS organization, Apache Software Foundation is a non-profit group of thousands of volunteers who contribute their time and skills to building useful software tools.
As the creators, Apache continuously works to enhance Hadoop code, including its distributed file system, called Hadoop Distributed File System (HDFS), as well as the code distribution and execution features known as MapReduce.
Within the past few years, Apache released nearly 50 related software systems and components for the Hadoop ecosystem. Several of these systems have counterparts in the commercial software industry. V
endors have packaged Apache’s Hadoop with user interfaces and extensions, while offering enterprise-class support for a service fee. In this segment of the OSS industry, Cloudera, Hortonworks, and Pivotal are leading firms serving big data environments.
Now software systems are so tightly developed to the core Hadoop environment that no commercial vendor has attempted to assimilate the functionality. The range of OSS systems, tools, products, and extensions to Hadoop include capabilities to import, query, secure, schedule, manage, and analyze data from various sources.
Corporate NAS and SAN technologies, cloud storage, and on-demand programmatic requests returning JSON, XML, or other structures are often secure repositories of ancillary data. The same applies to public datasets — freely available datasets, in many cases for economic activity by industry classification, weather, demographics, location data, and thousands more topics. Data of this measure demands storage.
Distributed file systems greatly reduce storage costs while providing redundancy and high availability. Each node has its local storage. These drives don’t require speed or solid-state drives, commonly called SSDs.
They are inexpensive, high-capacity pedestrian drives. Upon ingestion, each file is written to three drives by default. Hadoop’s management tools and the Name Node monitor each node’s activity and health so that poorly performing nodes can be bypassed or taken out of the distributed file system index for maintenance.
The term “data lake” describes the vast storage of data from music, documents, books, movies, conversations, images, text messages, announcements, and alerts. These vastly different data sources are in a minimum of a dozen different file formats.
Some are compressed or zipped. Some have associated machine data, as found in photos taken with any phone or digital camera. The date, camera settings, and often the location are available for analysis. For example, a query to the lake for text messages that included an image taken between 9 p.m. and 2 a.m. on Friday or Saturday nights in Orlando on an iPhone would probably show fireworks at Disney World in at least 25% of the images.
Enterprise administration of applications, their storage requirements, security granularity, compliance, and dependencies required Hadoop distributions (like those from Cloudera and Hortonworks) to mature these capabilities in the course of becoming a managed service to an enterprise.
On the previous page, you’ll see a view of Hadoop’s place among other software ecosystems. Note that popular analysis tools such as Excel and Tableau, databases such as SQL Server and Oracle, and development platforms such as Java and Informatica Data Quality are understood to be valuable tools in developing big data solutions. Administration through Cisco and HP tools is common.
Commercial software companies have begun connecting to Hadoop, offering functionality such as data integration, quality assessment, context management, visualization, and analysis from companies such as IBM, Microsoft, Informatica, SAP, Tableau, Experian, and other standard carriers.
Analytics are the endgame for developing a big data environment. The rise of big data has given credence to a new resource classification, the data scientist — a person who embodies an analyst, technologist, and statistician all in one.
Using several approaches, a data scientist might perform exploratory queries using Spark or Impala, or might use a language such as R or Python. As a free language, R is rapidly growing in popularity. It is approachable by anyone who is comfortable with macro languages such as those found in Excel. R and its libraries implement statistical and graphical techniques.
Moving to the cloud
Cloud computing is very different from server-class hardware and software (cloud storage, multi-tenant shared hosts, managed virtual servers) that is not housed on a company’s premises. In cloud environments, an organization does not own equipment, nor does it employ the network and security technologists to manage the systems.
Cloud computing provides a hosted experience, where services are fully remote and accessed with a browser.
The investment to build a 10- or 20-node HDFS cluster in the cloud is relatively small compared to the cost of implementing a large-scale server cluster with conventional technologies. The initial build-out of redundant centers by Amazon, Microsoft, Google, IBM, Rackspace, and others has passed. The profound knowledge that has been gained led to available systems at prices below the cost of a single technician. Today, cloud computing fees change rapidly with pricing measured by various usage patterns.
Conclusion: The rise of big data is evident
Big data is not a fad or soon-to-fade trending hashtag. Since you began reading this paper, 100 million photos have been created, with a sizeable portion having a first-degree relationship to your industry. And the pace of data creation continues to increase.
The distribution of computing processes is a necessity for enterprises to gain a 360-degree view of their customers through big data collection and analysis.
Big data technologies are becoming an industry standard in finance, commerce, insurance, healthcare, and distribution. Companies that embrace data solutions establish a self-service business-driven paradigm while improving management and operational processes, creating a competitive advantage that will withstand an ever-evolving marketplace.
Don’t be the organization that continues to only process operational data. Embrace the big data technologies and solutions that are now surging, and use the results to optimize your products for continued growth going forward.