Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties

August 23, 2016, 9:44 am

≫ Next: Bullish Hadoop Forecast Despite Spark Hype

≪ Previous: HPE Gobbles SGI for Larger Slice of HPC-Big Data Pie

Data lakes solve a lot of problems in today's big data world. When properly designed, they serve as an efficient means for storing large volumes and varieties of data. A data lake's efficiency comes from its inverse approach to data storage as compared to traditional structured data warehouses: Rather than enforcing rigid schemas upon arrival, data lakes allow data to exist, unconstrained, in native format.

This approach, commonly referred to as 'schema-on-read,' defers the time-consuming task of data modeling until a time when the enterprise has a clear idea of what questions it would like to ask of the data. In light of the influx of new data sources that enterprises must contend with — many of them unstructured and many whose value is not yet fully understood — this approach not only promotes agile response to new data sources, it also works to ensure that future analysis efforts aren't constrained by schema constraints that came before.

So that's the good news: data lakes hold the promise of a fundamentally better way of tackling the demands of big data. And, we have the technology and processing power to do it in the cloud, allowing for the massive scale of both storage and computing that today’s big data needs require.

Sounds great, but…

Building a data lake and pouring data into it is the easy part. The hard part is managing that data so that it’s useful. Without schemas or relational databases to provide context and consistency, the real challenge enterprises face is finding other ways to link and correlate the widely disparate data types stored in their data lakes — or risk them becoming a collection of siloed “data puddles.” In fact, it's been estimated that data scientists spend up to 80 percent of their time as “data janitors,” cleaning up raw data in preparation for analysis.¹Those highly trained experts should be focusing on analysis and insights, not merging and de-duplicating records.

Not to mention, it’s almost impossible for an enterprise to reliably predict its future analytics needs. That means a data lake designed for today’s storage and analytics needs may be completely ill equipped to handle tomorrow’s. In order to integrate new data types and solutions, the enterprise would need to constantly reconfigure the lake, consuming precious time and resources with each new integration. Enterprises need a solution now — and they need it fast.

Why the urgency?

Because readying data for analysis is the most time-consuming part of any big data initiative—and certainly the most challenging part of analyzing data stored in data lakes — the whole process requires an entirely new approach. Data has become a key competitive differentiator for companies. For some it’s even more valuable than their core product. Take Facebook, whose user data is orders of magnitude more valuable than its newsfeed. Uber and Air BnB have recognized the tremendous value of their data about users’ travel habits, which could far exceed the value of their actual services.

Why is this so critical right now? Four reasons:

We’re facing a continued dramatic escalation in the volume and variety of data inflow, which legacy systems are unprepared to handle.
Enterprises can’t predict their future data needs, but do know they’ll need to be able to react even faster than they do now. Current systems already can’t keep up — they need far greater agility.
Conventional data lakes that depend on relational databases are simply too clunky. As new business questions arise or new systems are brought to bear — layering on a graph database, a search engine or investigating a complex business question, for example — we need a solution that can create just-in-time data pools, grouping specialized data sets together amid the larger lake without full extraction, which legacy systems are unable to do.
The lines between data integration and management are blurring. This should be a symbiotic process, for which conventional data lake environments are not equipped. It calls for a solution that marries the two, allowing them to work in harmony.

dPaaS: A Future-Ready Approach to Big Data

dPaaS, or Data Platform as a Service, has emerged as a much more agile solution that unifies the closely-related operations of integration, data management and data storage (i.e. data lake hosting). This new, data-centric approach is critical for creating a foundation that can accommodate all types of data formats and analytics applications, now and in the future.

Unlike other solutions, which entail point-to-point hard-coded connections to connect the disparate databases that make up a data lake, with dPaaS, as data is ingested and integrated, all iterations are persisted in the central data repository (i.e., data lake), along with robust metadata to organically grow the data lake. Data management functions such as cleansing, deduplication, match and merge, etc. also take place at this time. With metadata attached, the data can be organized on the way in, moving from its raw format in the lake to specialized data stores, allowing investigators to comb through and transform the data specifically for the queries needed right now, without modifying the core data, enabling it to be transformed yet again in a different way to answer the next question.

When the dPaaS solution is delivered with a microservices approach, this enables even greater analytical flexibility and power, improving the usefulness of existing data, along with stability of the entire system. Unlike monolithic, packaged software that offers 100 features, but you really only need five, microservices solutions allow enterprises to customize their data lake utilization processes, adding only the functionality they need, when they need it, and easily removing functions they no longer require. Even better, this can be done on-demand, with no complex integration required. That means instead of modifying their business processes to fit a vendor’s application, enterprises can keep the process and use the microservices they need to get the job done.

For even greater efficiency, these integration and data management functions can be offered as managed services, which frees data scientists and other experts at the organization to focus on the end goal: analysis. It also puts the onus on the vendor offering the managed service to handle security, compliance and maintenance, eliminating the risks inherent in the likes of iPaaS solutions where self-service integration makes it harder to control governance, introducing potential security risks.

dPaaS Lets Enterprises Get to Work

With a cleansed and enriched lake of data now at the ready—and time available for data scientists to devise the queries—data analysis can be performed ad hoc using microservices and configuration-based system. Schemas and their outputs are modeled on the fly, providing rapid movement of data from the data lake into formats appropriate for the task at hand. These could be a graph database one week, a time series database the next, or a relational or key/value store the week after that.

Exciting new technological advances such as non-relational databases, distributed computing frameworks, and, yes, data lakes are all getting us closer to a data-inspired future. But, there's still no substitute for good old-fashioned data management. In fact, the need for diligent data governance is more important than ever, and the dPaaS approach to data operations makes sure this critical piece of the puzzle is not overlooked.

With dPaaS, the real promise of Big Data is within reach, giving enterprises the ability to actually use their data for maximum impact and competitive advantage.

¹The New York Times, "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insights," August 2014

Brad Anderson is vice president of big data informatics at Liaison Technologies.

↧

Bullish Hadoop Forecast Despite Spark Hype

August 25, 2016, 8:56 am

≫ Next: IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

≪ Previous: Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties

Despite major market inroads being made by Apache Spark, a new forecast estimates the global market for the Hadoop big data framework will continue to grow at a healthy clip through 2021, fueled in part by growing enterprise demand for Hadoop services.

According to a market forecast released this week by Allied Market Research, the Hadoop market is expected to grow at a 63.4 percent compound annual growth rate over the next five years, reaching $84.6 billion by 2021. The sustained growth is attributed in part to accelerated Hadoop adoption in Europe, where annual growth rates are expected to top 65 percent.

The sustained growth of the Hadoop market stems largely from higher rates of adoption in North America, especially in the IT, banking and government sectors as enterprise big data strategies have been rolled out.

The forecast includes Hadoop software, hardware and services. The market watcher found that Hadoop services accounted for nearly half (49 percent) of global demand. Hadoop services include consulting, "integration and deployment" along with middleware and support. Consulting and related services alone accounted for an estimated $1.6 billion in revenue last year, the market researcher said.

Meanwhile, increasing investments in big data analytics and "real-time operations" are expected to drive integration and deployment services, making it the fastest growing Hadoop services segment (64.8 percent) over the next five years. The "trade and transportation" sectors are expected to fuel adoption of Hadoop services during the forecast period, with an annual growth rate pegged at 76.3 percent through 2021.

Along with trade and transportation, other key end-users of Hadoop products and services include a banking, financial services and insurance category along with manufacturing, retail, telecommunications, healthcare and media and entertainment. "Factors such as aggrandized generation of structured and unstructured data and efficient and affordable data processing services offered by Hadoop technology are the major drivers of the market," the analyst concluded.

Along with the rise of Spark, other potential challenges to the continuing growth of the Hadoop adoption include distributed computing and security issues, researchers added.

Among the Hadoop market leaders identified in the forecast are Hortonworks Inc., Cloudera Inc. and Mark Logic that "have focused on development of advanced, Hadoop-based data storage, management, and analytics solutions to cater to the customized requirements of business enterprises."

The Hadoop market also may face headwinds as the much-hyped Spark platform matures. Billed as the next-generation data processing engine, Spark 2.0 is expected to offer a new structured streaming approach that could help unify development of big data batch and streaming applications.

Launched as a batch processing system for search data, another knock on Hadoop is that it was not designed for real-time, interactive analytics and reporting. Development work on new processing engines to supplement or replace Hadoop's core MapReduce processing engine have met with mixed results, critics contend.

↧

IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

September 27, 2016, 7:04 pm

≫ Next: The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

≪ Previous: Bullish Hadoop Forecast Despite Spark Hype

IBM’s aggressive push into the data analytics market continued today with the announcement of Project DataWorks, a Watson initiative that IBM said is the first cloud-based data and analytics platform to integrate all types of data and enable AI-powered decision-making.

Project DataWorks is designed to make it simple for business managers and data professionals to collect, organize, govern, secure and generate insight from multi-sourced, multi-format data. The goal: become what IBM calls “a cognitive business.” Project DataWorks deploys data products on the IBM Cloud using machine learning and Apache Spark while ingesting data from 50 to hundreds of Gbps and from a variety of endpoints: enterprise databases, Internet of Things, streaming, weather, and social media.

“It’s a system that will on-board data, tools, users, apps, all in a scalable and governed way,” Rob Thomas, VP of Products, IBM Analytics, told EnterpriseTech. “The purpose is simple: we are preparing all data within a company for use by AI. We’re helping people leap in to the future around AI and machine learning.”

Project DataWorks is intended to overcome much of the complexity involved in implementing big data analytics. Most of the work involved in large scale analytics projects is done by data professionals working in silos with disconnected tools and data services that may be difficult to manage, integrate, and govern. IBM said Project DataWorks helps break down barriers by connecting multi-format data. Data professionals can work together on an integrated, self-service platform, sharing common datasets and models that for better governance, while iterating data projects and products, with less time spent on finding and preparing data for analysis.

IBM's Rob Thomas

Available on Bluemix, IBM’s Cloud platform, Project DataWorks is “built entirely on Open Source,” Thomas said, “so clients can have access to all the innovation that’s in the Open Source community, but not deal with the headaches of trying to integrate those pieces.” IBM said Project Dataworks leverages an open ecosystem of more than 20 partners and technologies, such as Confluent, Continuum Analytics, Galvanize, Alation, NumFOCUS, RStudio and Skymind.

IBM also announced a list of customers using Project DataWorks, including Dimagi, KollaCode LLC, nViso, Quetzal, RSG Media, Runkeeper, SeniorAdvisor.com and TabTor Math.

RSG Media, which delivers analytical software and services to media and entertainment companies, is uses Project DataWorks to perform analytics across a large volumes of first- and third-party data sets. These include monitoring cross-platform content and advertising viewership, and identifying individual viewing behaviors while cross-analyzing demographic, lifestyle and social insights. RSG Media helps clients gain insights on audience preferences and develop programming schedules. According to the company, in one scenario, this resulted in a lift of $50 million to a single network’s bottom line.

“We realized that we needed more than just a cloud infrastructure provider. We needed a partner to help us manage data on an unprecedented scale, and empower our clients to turn that data into insight,” said Mukesh Sehgal, founder and chief executive officer, RSG Media. “IBM is the only cloud vendor who offers an integrated set of capabilities for building advanced analytics applications that would allow us to quickly and cost-effectively bring new offerings to market."

IBM also announced the DataFirst Method, a methodology that enables organizations to assess the skills and roadmap that instructs organization how to progress in their use of data, including practices and methods to help clients transform their processes for data discovery, handling and analytics.

↧

The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

October 3, 2016, 10:32 am

≫ Next: BSC Presents Plan to Energize Europe’s Big Data Efforts

≪ Previous: IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

Masters of the Data Universe: Uber, Netflix, Facebook, Amazon, Google. We know who they are. They hire genius data scientists who wrestle data into submission, building advanced analytics superstructures that light up their data and reveal insights about their markets while nurturing interactive customer relationships that leave everyone else shaking their heads, wondering how they do it.

It can seem like magic. As Forrester Research’s Brian Hopkins, vice president and principal analyst, observes, there’s the vague notion (encouraged by some vendors) that “that…somehow data goes into a box with the elephant and good things come out of the box, magic happens.”

But now, several years into the Big Data analytics revolution, patterns and common characteristics are emerging among the Masters of the Data Universe. According to Hopkins, they have landed on the critical issue: they view analytics through a strategic prism that distinguishes data from action.

Put another way: there’s a wide gulf between Big Data and actual analytics - much more of the former is going on than the latter. Most companies have built the initial Big Data framework that could be used in an analytics implementation. “We aren’t lacking for investment in new big data technologies,” Hopkins said at last week’s Strata + Hadoop conference in New York. “In fact, this year at Strata I assume every company has built a data lake. You probably have Hadoop.”

So while the enterprise landscape is a data lakes district, the lakes themselves are dead pools for generating the kind of market and customer insights that companies crave. Forrester ran a survey last year in which nearly three-quarters of data architects said they aspire to be data driven. “In the same survey, I asked how good are you at actually taking the results of analytics, the insights, and creating actions that matter to your firm? Far fewer people say they can do that well,” Hopkins said. The old saw that we’re drowning in data and thirsting for insight still holds true.

Hopkins outlined a strategic framework for moving toward transformative analytics, one based on his study of the Masters of the Data Universe and their shared traits for becoming an “Insight-Driven Business.” Attaining this status, he said, is both achievable and mandatory: failure to figure out Big Data analytics will mean significant competitive disadvantage and eventual decline.

The first step is to embrace a concept what Hopkins calls “digital insight,” the ability “to systematically harness and apply digital insights to create a sustainable competitive advantage.”

“By that I mean new actionable knowledge. It’s data, but it’s data that leads to action in the context of a process or a decision,” he said. The transformative aspect of “digital insight” is that new knowledge becomes embedded in software, getting it into the guts of the analytics system and serving as an insight engine that knows no rest.

This “refocuses our conversation from ‘How do I get insight (that’s) in the data?’ to ‘How do I implement insight – no matter where it comes from – in software?’ Hopkins said. “So it’s a marriage of the insight execution in application development.”

The second step is to understand how Insight Driven businesses operate. Hopkins said he has interviewed hundreds of “digital disrupters, ” ranging from the monsters (Netflix, et. al) to start-ups “trying to figure out what it is they’re doing differently.”

“It’s not the technology or what they did with data that’s so interesting,” he said, “it’s the fact that they understand how to apply insights in software to drive competitive advantage in ways that many of us don’t.”

Common among them is the circular and continual use/re-use of data within a closed loop system. “The pattern appears over and over again. They’re operating in this closed loop and they’re operating faster in a closed loop than their competition.” This “system of insight,” he said, is going to the right data, combining the right data and creating effective actions using a process Hopkins calls Insight to Execution.

The idea of “circular insights” is not new. What’s different about Insight to Execution, according to Hopkins, is “continuous experimentation, testing and learning with your insights. So (is it) having an insight, deploying a predictive model, updating that predictive model.” Updates aren’t happening every six months, either, they happening continually. Insight-Driven businesses “are always hypothesizing.”

Hopkins summarizes the Insights to Execution loop as:

Experiment and learn continuously – question every process and decision.
Identify outcomes and interim metrics – develop metrics for every outcome; instrument and measure processes, decisions and outcomes.
Gather more data – Start with data you have, then add new sources and kinds of data as you learn.
Develop insights – apply analytics and AI methods to develop potential insights.
Test and implement insights in software – run insights, experiments in software, processes and decisions
Measure results and refine insights – courageously assess and share the results. Did what you happen?

Hopkins cited Stitch Fix, a custom tailoring online clothing company. One way Stitch Fix is an Insight-Driven Business is that it continually enriches its data by running experiments to gain greater customer understanding – to the point of sending clothing to people whom they believe are likely to return items “just to learn stuff about the people who will keep the clothing, so they can make their models better. They’re running experiments. They’re experimenting and learning continuously as they go around the loop.”

A key element to this experimentation, he said, is “understanding the outcomes you want to change and getting granular about the level of detail so you can measure an instance.”

Brian Hopkins of Forrester Research

Hopkins said Stitch Fix followed the wisdom of starting a 1000-mile journey with a single step: they began with the data that they had, and worked their way up from there. This runs counter to many companies that begin by amassing enormous amounts of (unwieldy) data. Stitch Fix doesn’t “have a whole lot of really exotic data, they’re merely optimizing the data in that ‘System of Insight’ in which they transact with people.

“What a lot of these companies do is start that way, and as you go around the loop, then you add those secondary data sets,” Hopkins said. “But start with the data that you have, add more d over time as you find and drive those insights.

Hopkins said the testing and implementing insights in software “makes application developers as important as data scientists in this process.” At Stitch F ix, they’re the same person, he said, while at others, like Tesla, they’re different but they work together as a team. Hopkins said he’s talking to Insight-Driven CEOs who tell him they put as much emphasis on hiring good software developers, who can embed insight into code, as they do good data scientists.

Hopkins said Stich Fix’s chief algorithm officer (CAO) told him the company runs algorithms on Amazon S3, deploying them into their business applications use by employees to send clothing to customers, “feeding all that data back to the system in real time, back into S3, and round and round they go. It’s a pretty common pattern. You go to Uber, Netflix, they learned how to do this.”

He said they use Apache Yarn to stand up Spark clusters. “They’re…standing up instances of data science, updating those algorithms, deploying them back into their applications, and it’s not just that they’re doing this, it’s how quickly they can do it, they can update their algorithms very fast.”

The Stitch Fix CAO told Hopkins the company their analytics strategy has helped them understand “not only what their customers want to buy but how much to make and how much inventory to carry. So this way of working is not just a matter of engaging with the customer, it applies to the whole company front and back. They’ve dramatically reduced their inventory carrying costs as well as knowing their customers better than their competition. That’s the secret to success in today’s BI world.”

At Tesla, the advanced car manufacturer, the company has built what Hopkins calls an “insights fabric server” that serves as an analytics platform and gathering place for “all the different places that they keep data.”

“They say, ‘Look, there’s too much data to keep it all in Hadoop, we have to put it all over the place and bring into a platform.’” Inside the platform is a web-based UI that serves up insights on-demand to engineers that are “data-massaged with knowledge.” Using a set of data pipelines that feed the platform, data scientists and design engineers work together to update cars’ firmware.

“They’ve added horsepower, they’ve changed the elevation, they’ve changed the experience in real time, and what that does is create more data and goes then back through their pipelines into all their sources and that data then becomes available to their engineers to change the experience again.”

A final step: after implementing and testing insights in software, “you’ve got to courageously measure and share the results. And that’s not easy specifically because a lot of times those results will be ugly, and you won’t want to share them across the organization.” But at Uber, for example, “half the employees access their data warehouse every day. They share data,” Hopkins said. Sharing is critical to hypothesizing, learning and refining insights on an ongoing basis.

“This is what these companies do,” Hopkins said. “They go around this loop in hours and days, not weeks, months and years. That’s how they’re outpacing their competition.”

↧

BSC Presents Plan to Energize Europe’s Big Data Efforts

October 5, 2016, 4:00 pm

≫ Next: How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

≪ Previous: The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

Researchers from the Barcelona Supercomputer Center today presented the big data roadmap commissioned by the EU as part of the RETHINK big project intended to identify technology goals, obstacles and actions for developing a more effective big data infrastructure and competitive position for the Europe over the next ten years. Not surprisingly, the leading position of non-European hyperscalers was duly noted as a major roadblock.

Paul Carpenter, senior researcher at BSC and a member of the RETHINK big team, presented the projects results at the European Big Data Congress being held at BSC this week. The report, like so many EU technology efforts in recent years, is tightly focused on industry, developing a stronger technology supplier base as well as promoting use of advanced scale technology by commercial end-users. A major objective is to bring together academic expertise with industry to move the needle.

To a significant extent, the key findings are a warning:

Europe is at a strong disadvantage with respect to hardware / software co-design. The European ecosystem is highly fragmented while media and internet giants such as Google, Amazon, Facebook, Twitter and Apple and others (also known as hyperscalers) are pursuing verticalization and designing their own infrastructures from the ground up. European companies that are not closely considering hardware and networking technologies as a means to cutting cost and offering better future services run the risk of falling further and further behind. Hyperscalers will continue to take risks and transform themselves because they are the “ecosystem”, moving everybody else in their trail.
Dominance of non-European companies in the server market complicates the possibility of new European entrants in the area of specialized architectures.
Intel is currently the gatekeeper for new Data Center architectures; moreover, Intel is spearheading the effort to increase integration into the CPU package which can only exacerbate this problem.
The report notes pointedly that even today big data is a capricious and arbitrary term – no one has adequately defined it perhaps because it’s a moving target. What’s nevertheless clear despite is that the avalanche of data is real and managing and mining it will be increasingly important throughout society. What’s not clear, at least to most European companies according to the report, is how much to spend on advanced technology infrastructure to spend on the opportunity.

Many European companies, according to the report, are skeptical of the ROI and remain “extremely price-sensitive” with regard to adopting advanced hardware. Coupled the wariness over ROI with “the fact that there is no clean metric or benchmark for side-by-side comparisons for heterogeneous architectures, the majority of the companies were not convinced that the investment in expensive hardware coupled with the person months required to make their products work with new hardware were worthwhile.”

One reason for the more laissez-faire attitude, suggested the report, is the industry does not yet see big data problems, only big data opportunities: “This is largely the case because the industry is not yet mature enough for most companies to be trying to do that kind of analytics and all-encompassing Big Data processing that leads to undesirable bottlenecks.”

Industry is also still focused on finding how to extract value from their data, according to the roadmap, and companies are also still looking for the right business model to turn this value into profit. Consequently, they are not focused on processing (and storage) bottlenecks, let alone on the underlying hardware.

The plan suggests a timetable for many objectives, identifies a “technology readiness level” (TRL) for the goal, as well achievable elements. An example around networking technology is shown below.

thinkbigMany of the report finding’s were gleaned from interview or surveys with more than 100 companies across a broad spectrum of big data-related Industries including “major and up-and-coming players from telecommunications, hardware design and manufacturers as well as a strong representation from health, automotive, financial and analytics sectors.” A fair portion of the report also tackles needed technology innovation to handle big data going forward.

The RETHINK big roadmap, said Carpenter, helps provide guidance on gaining leadership in big data through, “stimulation of research into new hardware architectures for application in artificial intelligence and machine learning, and encouraging hardware and software experts to work together for co-design of new technologies.”

The report tackles a long list of topics: disaggregation of the datacenter; the rise of heterogeneous computing with an emphasis on FPGAs; use of software defined approaches; high speed networking and network appliances; trends toward growing integration inside the compute node (System on a Chip versus System in a Package). Interestingly quantum computing is deemed unready but neuromorphic computing is thought to be on the cusp of readiness and represents an opportunity for Europe which has supported active research through its Human Brain Project.

think-big-logoThe RETHINK big roadmap bullets out 12 action items, which constitute a good summary:

Promote adoption of current and upcoming networking standards. Europe should accelerate the adoption of the current and upcoming standards (10 and 40Gb Ethernet) based on low-power consumption components proposed by European Companies and connect these companies to end users and data-center operators so that they can demonstrate their value compared to the bigger players.
Prepare for the next generation of hardware and take advantage of the convergence of HPC and Big Data interests. In particular, Europe must take advantage of its strengths in HPC and embedded systems by encouraging dual-purpose products that bring these different communities together (e.g. HPC / Big Data hardware that can be differentiated in SW). This would allow new companies to sell to a bigger market and decrease the risk associated with development of new product.
Anticipate the changes in Data Center design for 400Gb Ethernet networks (and beyond). This includes paying special attention to photonics-on-silicon integration and novel Data Center interconnect designs.
Reduce risk and cost of using accelerators. Europe must lower the barrier to entry of heterogeneous systems and accelerators; collaborative projects should bring together end users, application providers and technology providers to demonstrate significant (10x) increase in throughput per node on real analytics applications.
Encourage system co-design for new technologies. Europe must bring together end users, application providers, system integrators and technology providers to build balanced system architectures based on silicon-in- package integration of new technologies, I/O interfaces and memory interfaces, driven by the evolving needs of big data.
Improve programmability of FPGAs. Europe should also fund research projects involving providers of tools, abstractions and high-level programming languages for FPGAs or other accelerators with the aim of demonstrating the effectiveness of this approach using real applications. Europe should also encourage a new entrant into the FPGA industry.
Pioneer markets for neuromorphic computing and increase collaboration. For neuromorphic computing and other disruptive technologies, the principal issue is the lack of a market ecosystem, with insufficient appetite for risk and few European companies with the size and clout to invest in such a risky direction. Europe should encourage collaborative research projects that bring together actors across the whole chain: end users, application providers and technology providers to demonstrate real value from neuromorphic computing in real applications.
Create a sustainable business environment including access to training data. Europe should address access to training data by encouraging the collection of open anonymized training data and encouraging the sharing of anonymized training data inside EC-funded projects. To address the lack of information sharing, Europe should encourage interaction between hardware providers and Big Data companies using the network-of-excellence instrument or similar.
Establish standard benchmarks. It is difficult for Industry to assess the benefits of using novel hardware. We propose establishing benchmarks to compare current and novel architectures using Big Data applications.
Identify and build accelerated building blocks. We propose to identify often-required functional building blocks in existing processing frameworks and to replace these blocks with (partially) hardware- accelerated implementations.
Investigate intelligent use of heterogeneous resources. With edge computing and cloud computing environments calling for heterogeneous hardware platforms, we propose the creation of dynamic scheduling and resource allocation strategies.
Continue to ask the question – Do companies think that hardware and networking optimizations for Big Data can solve the majority of their problems? As more and more companies learn how to extract value from Big Data as well as determine which business models lead to profits, the number of service offerings and products based on Big Data analytics will grow sharply. This growth will likely lead to an increase in consumer expectations with respect to these Big Data-driven products and services, and we expect companies to run into more and more undesirable performance bottlenecks that will require optimized hardware.
Clearly Europe – just as the U.S. – is mobilizing efforts to turn high performance technology into a competitive strength generally and also turning its attention to big data. Increasingly, that means a blending of HPC capabilities with big data analytics to turn the growing gush of data into scientific insight and commercial advantage. The latest report, roughly 50 pages, is a quick but worthwhile read for obtaining EU directional thinking.

Link to BSC press release: https://www.bsc.es/about-bsc/press/bsc-in-the-media/bsc-highlights-need-european-research-big-data-hardware-and

Link to the THINK big roadmap: http://www.rethinkbigproject.eu/sites/default/files/u273/D5.3RoadmapV23_0.pdf

↧

How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

October 21, 2016, 4:12 pm

≫ Next: SIEM Gains as Consumer Security Software Fades

≪ Previous: BSC Presents Plan to Energize Europe’s Big Data Efforts

Procter & Gamble, the $76 billion consumer packaged goods giant with sales to 5 billion consumers in 180 countries, has lots of data. Four years ago, as the company found itself waist-deep in a rising tide of opaque data, P&G embarked on a voyage of big data discovery, one that provides a series of technology and organizational object lessons for the host of quietly desperate companies (75 percent, according to Forrester Research) that worry they're not wrestling enough insight from their data.

Walking point into the analytics jungle for P&G is Terry McFadden, associate director and enterprise architect (Decision Support Systems), who has been at the company for 30 years. Known around P&G as a “part-time comedian,” McFadden has an avuncular, competent and upbeat bearing that, one can imagine, puts colleagues at ease and promotes camaraderie in a team embarking on a daunting task.

The initial impulse toward data analytics at P&G began around 2012 when new sources of data converged into today’s data onslaught.

“Our business teams needed more access, more data,” McFadden said recently in a presentation at Strata + Hadoop World in New York. “When you think about it, it’s sales, it’s market measurements, it’s weather, its social. They wanted more granularity. We needed to wrap our arms around more of it and really improve the time-to-decision. We wanted to get deeper into the explanatory analytics, answering the age old question: ‘Why?’… We needed the ability to acquire and integrate many different types of data – structured, unstructured – we had that problem relative to the volume, the variety, the velocity, the variability of data.”

Traditional approaches for gaining insight from data just weren’t adequate, McFadden said. “The volumes were growing, the costs were growing, and the time-to-insight required in terms of the business event cycle was speeding up, and so this is a real challenge.”

As can happen when new technology takes the stage, the impetus to embark on a new strategy came from a senior P&G exec who sent an email to the IT group extolling the potential offered by the exciting new field of big data analytics. “I remember seeing an email from upper management and I about came out of my chair,” McFadden said. “It was one of those moments where – I’m teasing – somebody must have read something in Sky Magazine about Big Data, about what they ought to do and what products they ought to apply.”

Soon enough, a VP asked McFadden to come to his office. “We need to solve big data,” McFadden said to him.” “Fine,” said the VP, and began issuing battle orders: “You will do the architecture, you will do the evaluation of technologies, you will recommend the technology, you will drive the PoC, you will make a final recommendation, you will work with our folks and identify the key problems we’re going to go after.”

And he gave McFadden six months.

Terry McFadden of P&G

Where to begin? McFadden took a page out of a Gartner Group playbook that the consulting group recently published: “Mastering Data and Analytics: 'Table Stakes' for Digital Business.” In the initial phase, the idea is to go big and small simultaneously: take on strategic challenges that senior people are struggling with, in order to demonstrate value, but proceed on a baby-step basis:

“Start small,” says Gartner, “and build momentum by alleviating operational pain and deriving business value.” Small steps are more likely to proceed, and each success generates more organizational buy-in.

It was decided to go after category management as the broad and ambitious purview for data analytics at P&G. And the targeted beneficiaries were “embedded business analysts,” described by McFadden as advisers whom business executives rely on to “provide insight to daily problems, an ever increasing set of tough questions that are continually changing.” If ever there were a group at P&G what would value big data analytics, this was it.

McFadden said the embedded analysts were sent surveys asking them: “Imagine you had a magic 8-ball and you could ask it anything, what are the wicked problems you wish it would answer.” The answers helped guide the initial proof points for McFadden’s team.

“We were gated by our ability to sell services – and really, it’s sell solutions and solve problems – what are the high value problems that people were willing to pay for,” McFadden said, adding that an use case would be live “sense and respond” interactions for customer-facing managers involved in a new product launch. Another example: supporting better interfacing between manufacturing and retailers “to really create a win/win scenario. “It’s a challenging issue, it’s absolutely a big data problem, that in our proof of concept there were about two dozen data sets to bring together that we had never brought together before.”

This required the ability to quickly load and integrate structured, unstructured and semi-structured data. Within two weeks, McFadden’s team was able to load more than 25 data sources, including market signals, item sales, market share, surveys, social media and demographics, along with traditional sources, all within P&G’s data warehouse.

In building the infrastructure, McFadden said it was decided early on that a top priority, architecturally, was to build an analytics ecosystem that did not take the data off-platform. Back then, in 2013, the tools vendors assured customers that they did Big Data and that they had Hadoop, but that an outboard platform was involved, “and the size of the outboard motor and attendant costs of that model was not attractive to us,” McFadden said.”

After an extensive search, McFadden’s team found this capability at Arcadia Data, a visual analytics and BI platform for big data that enables users to create visualizations of Hadoop data.

“Yes, we have other tools associated with the stack,” McFadden said, “and we continue to look at and evaluate the marketplace, but we believe fundamentally in the model of moving the work to the data, and not moving the data off platform to the work. We think that’s a core, winning tenant of the Hadoop ecosystem. And when I say that, it’s everything that’s grown around…that ecosystem that morphs with different processing frameworks that is very attractive. Those that move the work to the data continues to show an ROI that allows you to build out more and more.”

McFadden characterized the P&G’s analytics physical landscape as largely a Cloudera stack, “a typical picture, the creatures in the zoo,” an infrastructure based on an appliance approach that enables high speed connectivity to our high speed d warehouse ecosystem.

He said results came quickly.

“Literally, inside of two weeks we were up and running,” he said. “We had a SWAT team deployed to help us. I didn’t expect anybody to say: ‘I’m just going to get a big data container, put data in it, shake it like the Magic 8-ball and get answers. But inside of two weeks, with a lot of help from our partners, we were able to start banging on the data and get after some of these (high value) questions.”

McFadden and his team went from there, building analytics capabilities use case by use case.

“It wasn’t a big bang,” he said. “No one told us, ‘Here’s the budget money, now move it all over.’ This was: We’re going to focus on high value problems that people are willing to pay for to be able to solve. There’s a dragon in their business and they want it slain, and they haven’t been able to do that with traditional approaches. “

The result has been making believers out of analysts and managers who might otherwise regard data analytics as a fad, “or who would chant ‘obsolete before plateau.’

“It’s sell a little, make a little, we hope learn a lot, grow and keep on investing,” McFadden said. “That’s been our approach.” And from a team management point of view, it’s “a lifetime of attention and study that’s going to be required here.”

↧

SIEM Gains as Consumer Security Software Fades

July 15, 2016, 8:30 am

≫ Next: Enterprises Embrace Machine Learning

≪ Previous: How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

Security information and event management (SIEM) software fueled a robust global security software market in 2015 even as sales of consumer security software declined sharply last year, according to the latest accounting by market analyst Gartner Inc.

Gartner (NYSE: IT) reported this week that worldwide security software revenue jumped 3.7 percent over the previous year to $22.1 billion. Sales of SIEM software used to support threat detection and response to security breaches rose by a whopping 15.8 percent year-on-year as it gained market traction via its real-time collection and analytics capabilities. The analytics software is used to sift through a wide variety of event and contextual data sources to provide an historical analysis of security breaches.

Meanwhile, Gartner reported that global sales of consumer security software tanked in 2015, dropping 5.9 percent on an annual basis. Market leader Symantec (NASDAQ: SYMC) took the biggest hit, with annual revenues declining by an estimated 6.2 percent from the previous year.

Overall, leading consumer vendors registered a collective decline in revenues estimated at 4.2 percent in 2015. The declines for Symantec and second-ranked vendor Intel Corp. (NASDAQ: INTC) were attributed to a drop in consumer security and endpoint protection platform software. The latter combines device security functionality into a single capability that delivers antivirus, anti-spyware, firewall and host intrusion prevention.

Of the top five vendors ranked by Gartner, only IBM registered revenue growth last year on the strength of its SIEM sales along with its service business, which the market watcher noted also generates for its product segment. IBM (NYSE: IBM), which integrated its SIEM platform with market leader Resilient Systems last year, acquired the “incident response” specialist earlier this year.

"The below-market growth seen by these large vendors with complex product portfolios is in contrast to the market growth and disruption being introduced by smaller, more specialized security software vendors," Gartner research analyst Sid Deshpande noted in a statement releasing the revenue totals.

The sharp decline in consumer security software also reflects the growing sophistication of security breaches such as ransomware and the desire by more enterprises to detect and blunt attacks as they unfold. Businesses also are realizing that upfront investments in analytics-based approaches like SIEM may yield future savings as the cost of dealing with a single security breach can easily reach into the millions of dollars.

Hence, the core capabilities of SIEM technology are increasingly seen as a more comprehensive way of collecting data points on security "events" along with the ability to correlate and analyze those events across a range of data sources, Gartner noted.

So-called "operational intelligence" vendor such as Splunk Inc. (NASDAQ: SPLK) have recently released new versions of security and user behavior analytics packages. The new capabilities are said to combine the best features of machine learning and anomaly detection to sift through and prioritized data breaches and other threats.

Meanwhile, other emerging SIEM platforms are designed to automate security processes and policies used to respond to everything from insider attacks to lost mobile devices.

↧

Enterprises Embrace Machine Learning

July 15, 2016, 10:39 am

≫ Next: HPE Gobbles SGI for Larger Slice of HPC-Big Data Pie

≪ Previous: SIEM Gains as Consumer Security Software Fades

Machine learning technology is poised to move from niche data analytics applications to mainstream enterprise big data campaigns over the next two years, a recent vendor survey suggests.

SoftServe, a software and application development specialist based in Austin, Texas, reports that 62 percent of the medium and large organizations it polled in April said they expect to roll out machine learning tools for business analytics by 2018. That majority said real-time data analysis was the most promising big data opportunity.

The survey authors argue that artificial intelligence-based technologies like machine learning are moving beyond the "hype cycle" as enterprise look to automate analytics capabilities ranging from business intelligence to security. (In the latter case, the Defense Advanced Research Projects Agency is sponsoring an "all-machine hacking tournament" in conjunction with next month's DEF CON hacking convention in Las Vegas. The goal is to demonstrate that that cyber defenses can be automated as more infrastructure is networked via an Internet of Things.)

The survey found that the financial services sector is among the early adopters of big data analytics and emerging approaches such as machine learning. About two-thirds of financial services companies said analytics was a "necessity" to stay competitive while 68 percent said they expect to implement machine-learning tools within the next two years.

Among the incentives for early adoption is growing pressure on financial institutions "to close the gap between the experiences they provide and what consumers have come to expect," the survey authors noted. Big data is increasingly seen as a way to increase client demand for a faster and more accurate service, the added.

For the IT sector, big data is widely viewed as a way to reduce operating costs such as software licensing and commodity hardware savings.

Meanwhile, tools like machine learning also are perceived as helping to break down data siloes while improving the quality of business intelligence data used in decision-making. The survey cited estimates that poor quality data can cost businesses as much as $14 million a year. "A big data transformation is able to overcome this challenge by systematically integrating these silos – and turning bad data into good information," the survey asserts.

"Businesses that take the plunge and implement machine learning techniques realize the benefits early on – it’s big a step forward because it delivers prescriptive insights enabling businesses to not only understand what customers are doing, but why," Serge Haziyev, SoftServe's vice president of technology services, noted in a statement.

The survey of 300 executives in the U.K. and U.S. also found that the retail sector is most concerned about data governance issues.

↧

HPE Gobbles SGI for Larger Slice of HPC-Big Data Pie

August 11, 2016, 8:13 pm

≫ Next: Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties

≪ Previous: Enterprises Embrace Machine Learning

Hewlett Packard Enterprise (HPE) announced today that it will acquire rival HPC server maker SGI for $7.75 per share, or about $275 million, inclusive of cash and debt. The deal ends the seven-year reprieve that kept the SGI banner flying after Rackable Systems purchased the bankrupt Silicon Graphics Inc. for $25 million in 2009 and assumed the SGI brand.

Bringing SGI into its fold bolsters HPE’s high-performance computing and data analytics capabilities and expands its position across the growing commercial HPC market and into high-end supercomputing as well. Per analyst firm IDC’s latest figures, the HPC market is at $11 billion and set to grow at an estimated 6-8 percent CAGR over the next three years. The data analytics segment, which is very much in play here, is said to be growing at over twice that rate. “Big data combined with HPC is creating new solutions, adding many new users/buyers to the HPC space,” stated IDC in its June HPC market update.

A joint announcement from HPE and SGI focused on how this explosion in data is driving increased adoption of high-performance computing and advanced analytics technologies in government and commercial sectors. HPC systems are critical for advancing such fields as weather forecasting, life sciences, and increasingly for cybersecurity and fraud detection, said HPE.

“Once the domain of elite academic institutions and government research facilities, high-performance computing (HPC) – the use of ‘super’ computers and parallel processing techniques for solving complex computational problems – is rapidly making its way into the enterprise, disrupting industries and accelerating innovation everywhere. That’s because businesses today are recognizing the big potential in the seas of their corporate data,” Antonio Neri, executive vice president and general manager, HP Enterprise Group, shared in a blog post.

He continued: “Organizations large and small are adopting HPC and big data analytics to derive deeper, more contextual insights about their business, customers and prospects, and compete in the age of big data. These businesses see revenue opportunity in the explosion of data being generated from new sources, like the proliferation of mobile devices, the Internet of Things, the ever-expanding volumes of machine-generated data, and the increase of human data in the form of social media and video.”

SGI CEO Jorge Titinger also emphasized the benefits of the union for data-driven organizations. “Our HPC and high performance data technologies and analytic capabilities, based on a 30+ year legacy of innovation, complement HPE’s industry-leading enterprise solutions. This combination addresses today’s complex business problems that require applying data analytics and tools to securely process vast amounts of data,” he said. “The computing power that our solutions deliver can interpret this data to give customers quicker and more actionable insights. Together, HPE and SGI will offer one of the most comprehensive suites of solutions in the industry, which can be brought to market more effectively through HPE’s global reach.”

SGI makes server, storage and software products, but it’s the UV in-memory computing line that has lately been the coveted star of the company’s portfolio. In February, SGI signed an OEM agreement with HPE for its UV 300H technology, a version of the SGI UV 300 supercomputer that is purpose-built for SAP HANA. As we noted previously, the 8-socket server “filled the gap between its HPE ProLiant DL580 Gen9 Server, with 4-socket scalability at the low end, and the HPE Integrity Superdome X server that scales up to 16 sockets and 24 TB of memory at the high end.”

Notably Dell and Cisco are both resellers for the entire SGI UV 300H line, which scales as a single node system from 4-32 sockets in four socket increments. Just how the SGI sale will affect these arrangements remains to be seen, but it’s hard to imagine Dell as a reseller for HPE.

In the high-end supercomputing segment (systems above $500k per IDC), HPE was the top earner among HPC server vendors in 2015: taking in $1.23 billion in revenue out of a total $3.28 billion. Cray came in second ($583 million) and then Lenovo ($391 million). SGI’s share was $88 million.

IDC 2015 Revenue Share by Vendor - supercomputing

SGI, now located in Milpitas, Calif., after selling its storied Silicon Valley headquarters to Google in 2006, brought in $533 million total revenue in FY16 and $521 million in FY15. Its GAAP net loss for 2016 was $11 million, or $(0.31) per share compared with a net loss of $39 million, or $(1.13) per share in 2015. The company has approximately 1,100 employees.

The deal's $7.75 per share price represents a 30 percent premium over today's closing price of $5.98. In after hours trading, shares of SGI have gone up by nearly 30 percent to $7.70. HPE's stock closed today at $21.78, falling just .05 percent in after hours trading to $21.77.

The transaction is on track to close in the first quarter of HPE’s fiscal year 2017, after which SGI will become part of the HPE Data Center Infrastructure group, led by Alain Andreoli. HPE expects the effect on earnings to be neutral in the first full year after close of sale and accretive thereafter.

The SGI purchase is the latest in a series of big changes for the HP brand. Last September, Hewlett-Packard officially split the PC and printer business from its enterprise (and HPC) activities, creating Hewlett-Packard Enterprise (HPE) to focus on servers, storage, networking, security and corporate services. In May, HPE went through another split when it merged its enterprise services unit with CSC to create a $26 billion “pure-play” IT services organization.

↧

Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties

August 23, 2016, 9:44 am

≫ Next: Bullish Hadoop Forecast Despite Spark Hype

≪ Previous: HPE Gobbles SGI for Larger Slice of HPC-Big Data Pie

Sounds great, but…

Why the urgency?

Why is this so critical right now? Four reasons:

We’re facing a continued dramatic escalation in the volume and variety of data inflow, which legacy systems are unprepared to handle.
Enterprises can’t predict their future data needs, but do know they’ll need to be able to react even faster than they do now. Current systems already can’t keep up — they need far greater agility.
Conventional data lakes that depend on relational databases are simply too clunky. As new business questions arise or new systems are brought to bear — layering on a graph database, a search engine or investigating a complex business question, for example — we need a solution that can create just-in-time data pools, grouping specialized data sets together amid the larger lake without full extraction, which legacy systems are unable to do.
The lines between data integration and management are blurring. This should be a symbiotic process, for which conventional data lake environments are not equipped. It calls for a solution that marries the two, allowing them to work in harmony.

dPaaS: A Future-Ready Approach to Big Data

dPaaS Lets Enterprises Get to Work

With dPaaS, the real promise of Big Data is within reach, giving enterprises the ability to actually use their data for maximum impact and competitive advantage.

¹The New York Times, "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insights," August 2014

Brad Anderson is vice president of big data informatics at Liaison Technologies.

↧

Bullish Hadoop Forecast Despite Spark Hype

August 25, 2016, 8:56 am

≫ Next: IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

≪ Previous: Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties

Along with the rise of Spark, other potential challenges to the continuing growth of the Hadoop adoption include distributed computing and security issues, researchers added.

↧

IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

September 27, 2016, 7:04 pm

≫ Next: The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

≪ Previous: Bullish Hadoop Forecast Despite Spark Hype

Project DataWorks is designed to lower the complexity for business managers and data professionals to collect, organize, govern, secure and generate insight from multi-sourced, multi-format data. The goal: become what IBM calls “a cognitive business.” Project DataWorks deploys data products on the IBM Cloud using machine learning and Apache Spark while ingesting data from 50 to hundreds of Gbps and from a variety of endpoints: enterprise databases, Internet of Things, streaming, weather, and social media.

IBM's Rob Thomas

IBM also announced a list of customers using Project DataWorks, including Dimagi, KollaCode LLC, nViso, Quetzal, RSG Media, Runkeeper, SeniorAdvisor.com and TabTor Math.

↧

The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

October 3, 2016, 10:32 am

≫ Next: BSC Presents Plan to Energize Europe’s Big Data Efforts

≪ Previous: IBM Project DataWorks: Joining Multi-Sourced Data for AI-based Analytics

Masters of the Data Universe: Uber, Netflix, Facebook, Amazon, Google. We know who they are. They hire genius data scientists who wrestle data into submission, building elite analytics superstructures that light up their data and reveal insights about their markets while nurturing interactive customer relationships at scale that leave everyone else shaking their heads, wondering how they do it.

It can seem like magic. As Forrester Research’s Brian Hopkins, vice president and principal analyst, observes, there’s the vague notion (encouraged by some vendors) “that…somehow data goes into a box with the elephant and good things come out of the box, magic happens.”

The first step is to embrace a concept what Hopkins calls “digital insight,” the ability “to systematically harness and apply digital insights to create a sustainable competitive advantage.”

Hopkins summarizes the Insights to Execution loop as:

Experiment and learn continuously – question every process and decision.
Identify outcomes and interim metrics – develop metrics for every outcome; instrument and measure processes, decisions and outcomes.
Gather more data – Start with data you have, then add new sources and kinds of data as you learn.
Develop insights – apply analytics and AI methods to develop potential insights.
Test and implement insights in software – run insights, experiments in software, processes and decisions
Measure results and refine insights – courageously assess and share the results. Did what you happen?

A key element to this experimentation, he said, is “understanding the outcomes you want to change and getting granular about the level of detail so you can measure an instance.”

Brian Hopkins of Forrester Research

“This is what these companies do,” Hopkins said. “They go around this loop in hours and days, not weeks, months and years. That’s how they’re outpacing their competition.”

↧

BSC Presents Plan to Energize Europe’s Big Data Efforts

October 5, 2016, 4:00 pm

≫ Next: How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

≪ Previous: The ‘Insight-Driven Business’: How to Become a Master of the Data Universe

To a significant extent, the key findings are a warning:

think-big-logoThe RETHINK big roadmap bullets out 12 action items, which constitute a good summary:

Link to BSC press release: https://www.bsc.es/about-bsc/press/bsc-in-the-media/bsc-highlights-need-european-research-big-data-hardware-and

Link to the THINK big roadmap: http://www.rethinkbigproject.eu/sites/default/files/u273/D5.3RoadmapV23_0.pdf

↧

How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

October 21, 2016, 4:12 pm

≫ Next: Pure Storage Makes Splash With ‘Big Data Flash’

≪ Previous: BSC Presents Plan to Energize Europe’s Big Data Efforts

The initial impulse toward data analytics at P&G began around 2012 when new sources of data converged into today’s data onslaught.

And he gave McFadden six months.

Terry McFadden of P&G

After an extensive search, McFadden’s team found this capability at Arcadia Data, a visual analytics and BI platform for big data that enables users to create visualizations of Hadoop data.

He said results came quickly.

McFadden and his team went from there, building analytics capabilities use case by use case.

The result has been making believers out of analysts and managers who might otherwise regard data analytics as a fad, “or who would chant ‘obsolete before plateau.’

↧

Pure Storage Makes Splash With ‘Big Data Flash’

January 30, 2017, 9:16 am

≫ Next: Cloud Data Manager Rubrik is Flush With Cash

≪ Previous: How P&G Got Hooked on Analytics: ‘High Value Problems People Were Willing to Pay For’

As data intensive storage workloads proliferate across a range of industries, storage vendors are attempting to upgrade and scale their platforms to speed the analysis of a torrent of unstructured data.

The latest attempt comes from all-flash storage array specialist Pure Storage (NYSE: PSTG), which last week released an updated version of its FlashBlade solid-state array. The company said an 8.8 Tb and 52 Tb blade capacities are generally available along with accompanying software.

The upgrade comes 10 months after the Pure Storage, Mountain View, Calif., unveiled its FlashBlade platform. Since its official release last July, the storage leader claims it has made significant inroads in areas such as real-time and big data analytics, financial analysis and energy exploration. All of those segments are looking for new approaches for handling the proliferation of unstructured data while attempting to connect the dots to make sense of it all.

Those use cases are driving a new set of storage and other workloads requirements that industry watchers assert have made traditional storage architectures obsolete. The result has been the emergence of "big data flash" platforms that began generating revenues for the first time in 2016.

"Big data flash platforms are optimized to handle very large unstructured data sets with high degrees of concurrency while delivering flash performance and reliability," said Eric Burgener, IDC's research director for storage. The market analyst estimates the emerging all-flash storage market will reach more than $1 billion in revenues by 2020.

Pure Storage, which launched an initial stock offering in October 2015, has since been positioning itself to leverage the transition from analyzing historical data in batch mode to real-time analytics driven by emerging tools such as Apache Spark.

Along with all-flash arrays, the real-time approach requires scalable file and object storage. Hence, Pure Storage CEO Scott Dietzen stressed in a statement that the company has expanded the FlashBlade platform to handle the "rapidly expanding world of unstructured data."

The company also cited a batch of emerging use cases for its all-flash storage architecture, including a big data genomics project at the University of California at Berkeley along with banking applications based on cloud and software-as-a-service approaches.

The UC-Berkeley project includes complex analysis and running data-intensive visualizations in three dimensions, the company noted, asserting the Spark queries that previously took 12 hours have been reduce to about 30 minutes.

Those use cases are based on early deployments of the FlashBlade platform designed to "harden" the platform across varied workloads, the company noted. Pure Storage also said it gained experience in running large-scale Apache Spark clusters for tasks such as machine learning and SQL query processing.

The storage vendor also said it has identified similarities among different workloads. "We’ve observed that analyzing genomes for clinical diagnostics is, from a workload perspective, very similar to the way geophysicists use clusters of computers to perform geophysical mapping in oil and gas," the company noted in a blog post.

"Simultaneously, we realized how similar this flow is to the way a data scientist uses Apache Spark for business analytics or to create scenarios for machine learning."

↧

Cloud Data Manager Rubrik is Flush With Cash

May 3, 2017, 8:28 am

≫ Next: Data Dilemma: What to Store, What to Dump?

≪ Previous: Pure Storage Makes Splash With ‘Big Data Flash’

Rubrik, the cloud data management startup and cash magnet, said this week it has completed its latest funding round, generating a hefty $180 million and raising its venture capital haul to $292 million in four funding rounds.

Founded in early 2014 by veteran engineers from Data Domain, Facebook (NASDAQ: FB), Google (NASDAQ: GOOGL) and VMware (NYSE: VMW), the startup offers application development tools that combine enterprise data management with web-scale IT.

The latest funding round was led by IVP (Institutional Venture Partners) along with earlier investors Greylock Partners and Lightspeed Venture Partners. IVP also was an early investor in API specialist Mulesoft (NYSE: MULE) and Snapchat (NYSE: SNAP), which both went public in March.

Rubrik said it would use its deep pockets to expand product development and global technology investments.

The Palo Alto, Calif., company claims a run-rate approaching $100 million in its first six quarters and a market valuation estimated at $1.3 billion. Fortune 500 customers come from the financial services, government, healthcare and retail sectors.

The startup appears to have caught the wave of hybrid and multi-cloud adoption as large enterprises look to leverage public cloud flexibility to help deliver distributed applications and other data management services.

Rubrik said it has launched eight products over the last two years that support more than a dozen applications across multiple cloud platforms. For example, it recently announced cloud native applications running on Amazon Web Services (NASDAQ: AMZN) and Microsoft Azure (NASDAQ: MSFT) along with data orchestration tools operating across multiple clouds.

"We have not even dipped into the $61 million from Series C since we’ve been judiciously fueling our hyper-growth with cash flow," boasted Bipul Sinha, Rubrik's co-founder and CEO, in a blog post. "This financing allows us to accelerate our innovation pace and build a global brand, all while upending a stagnant industry of legacy players."

The company argues there is a huge unmet need for a cloud native data management approach. Hence, Rubrik seeks to go beyond simplifying data backup and recovery to offer "instant data access for recovery, analytics, compliance and search," Sinha said.

Sinha was also a founding investor in enterprise virtualization and storage vendor Nutanix. Arvind Jain, a former Google distinguished engineer, serves as Rubrik's vice president of engineering.

Rubrik's data management approach is designed to allow enterprises to provision data for application development regardless of where apps reside. Hence, data can move securely between datacenters and the cloud.

The startup's surge also is fueled by the steady enterprise shift to cloud-based analytics. Rubrik and a band of data management upstarts anticipated the impact of database deployments along with the resulting data explosion. That is forcing companies to reconsider data management schemes as they increasingly seek to unlock silos of enterprise application data for business analytics.

↧

Data Dilemma: What to Store, What to Dump?

June 6, 2017, 2:52 pm

≫ Next: Splunk Beefs Up Cloud Monitoring Tool

≪ Previous: Cloud Data Manager Rubrik is Flush With Cash

As storage costs decline on a Moore's Law curve, technology vendors seeking to drive storage capacity and enterprises leveraging that stored data for new applications agree that we have plenty of data available but remain woefully short on wisdom.

As more connected devices fuel the data explosion, a growing problem for storage and analytics providers is figuring out how to prioritize data as millions and eventually billions of devices are connected. "Too much data but not enough knowledge," is the way Steve Luczo, CEO of storage specialist Seagate Technology (NASDAQ: STX) described the dilemma during a company-sponsored panel discussion on Monday (June 5).

Seagate, Cupertino, Calif., also released the results of a report by market analyst IDC that forecasts data creation will soar to 163 zettabytes by 2025.

"From a storage perspective, we have to get ready for [zettabytes] in terms of available data" and deliver platforms "so the priority data can be stored," Luczo added. For users, it often comes down to, "What can you afford to store?"

One measure of the data flood is the number of embedded devices per person feeding into datacenters. IDC reckons that number will jump from less than one to more than four per "connected" person by 2025.

Seagate CEO Steve Luczo

Even as consumers generate more data, that trend is expected to shift over the next decade as 60 percent of all data is generated by enterprises, according to the IDC study. Meanwhile, the number of use cases is growing, Luczo noted. "If we don't continue to drive down the cost of technology, then these use cases aren't going to develop."

Among those uses cases are consumer applications ranging from autonomous vehicles and other "edge" data platforms to entertainment and personalized medicine. For these applications, developers are attempting to bake in real-time data collection capabilities. "Even though you're not going to process all your data in real time, you have to collect it in real time because you may want to react in close to real time to what consumers are doing," noted Miguel Alvarado, vice president of data and analytics at Vevo, an online music video service.

With all that personal and unstructured data flying around, technology vendors and users alike are struggling to figure ways of organizing it to solve problems while making a few bucks in the process. "Really what we want to do is be predictive," Luczo observed.

As enterprises generate far more data over the next decade, other obligations arise as cloud vendors strive to protect data while extracting value from it. Unlike consumer markets, enterprise data "tends to be more siloed" due to security, privacy and other compliance requirements, noted Kushagra Vaid, general manager Microsoft's Azure infrastructure unit.

Hence, the steady analytics advances in the consumer sector ranging from sharing and learning from data may not apply on the same scale to the enterprise sector, Vaid stressed. "How do we keep parity between the advancements…in the consumer space and the enterprise space?" he asked.

Meanwhile, storage vendors like Seagate are working to increase areal density to pack more storage capacity into smaller devices. Densities are increasing at a 30-40 percent compound annual rate. At the same time, Vaid noted, cloud demand is rising at about 130 percent, creating a divergence: "The demand at which the cloud is growing is four times bigger than the rate at which the underlying storage technology density is growing."

Hence, he added, breakthrough storage technologies will be needed to increase areal density as scaling slows for existing approaches.

Data proliferation, especially "transient data" used mostly for real-time analytics, also raises questions about whether big data is approaching a point of diminishing returns. "We are data rich and knowledge poor," agreed Vevo's Alvarado, who called for industry standards on data collection.

"Nobody cares about video surveillance of an empty parking lot," stressed IDC analyst Dave Reinsel. Advocating a new form of data compression, Reinsel added, "Sure, we can collect a lot of data that's overwhelming but…we'll figure out how much data we'll actually need to make decisions.

"There's a learning curve that were going to struggle through," the analyst concluded. "We will make it manageable at the end of the day."

↧

Splunk Beefs Up Cloud Monitoring Tool

August 15, 2017, 6:06 am

≫ Next: As Object Storage Booms, Analytics Issues Emerge

≪ Previous: Data Dilemma: What to Store, What to Dump?

As enterprises accelerate the shift to a hybrid private/public cloud model, a growing list of data analytics vendors are stepping up to offer cloud monitoring and other tools designed to ease cloud migrations that are expected to consume the lion's share of IT budgets over the next year and a half.

Among them is Splunk Inc., which this week rolled out a cloud-monitoring tool specifically geared to tracking workloads running on Amazon Web Services (NASDAQ: AMZN) public clouds. The analytics tool is designed to increase visibility into AWS infrastructure to track cloud usage and fine tune cloud infrastructure while gauging workload performance and application availability.

Cloud users "need the ability to correlate data sources across environments, in real time to derive maximum value," Rick Fitz, Splunk's senior vice president of IT markets, noted in a statement.

San Francisco-based Splunk (NASDAQ: SPLK), is among a growing number of cloud analytics vendors providing application and workload monitoring tools as more companies embrace hybrid cloud deployments and shift sensitive workloads to AWS and other public clouds.

Other entrants in the cloud-monitoring arena include software asset management specialists such as Flexera Software, which launched a public cloud-monitoring tool last year, and newcomers such as Wavefront, which is developing tools that help software-as-a-service vendors improve their DevOps functions. As of last fall, the cloud monitoring startup had raised $52 million in venture capital.

Meanwhile, Flexera's monitoring tool initially targeted AWS with plans to extend the services to Google Cloud (NASDAQ: GOOGL) and Microsoft Azure (NASDAQ: MSFT) and other public cloud platforms.

Splunk said Monday (Aug. 14) its monitoring tool could be used to correlate AWS cloud data with other sources and across hybrid platforms using visualizations and pre-built dashboards.

This and other cloud analytics tools respond to a steady enterprise shift to public clouds, including "a potential consolidation of cloud providers or solutions," according to a recent cloud security report released by Intel's McAfee cyber security unit.

The study found that the average number of cloud services used by respondents to its survey declined from 43 in 2015 to 29 last year. "Cloud architectures also changed significantly, from predominantly private-only in 2015 to increased adoption of public cloud resulting in a predominantly hybrid private/public infrastructure in 2016," McAfee reported.

Those trends are attracting data analytics vendors such as Splunk that specialize in parsing machine data to provide intelligence about IT operations.

Splunk said its Insights for AWS Cloud Monitoring tool runs with several AWS tools, and is available only under license on an annual basis with pricing starting at $7,500 per year. A 30-day free trial is available on the AWS Marketplace.

↧

As Object Storage Booms, Analytics Issues Emerge

April 19, 2018, 10:31 am

≫ Next: Intel, Micron Up the Ante for Flash Memory

≪ Previous: Splunk Beefs Up Cloud Monitoring Tool

With object-based storage capacity predicted to grow at double-digit annual rates over the next several years, attention is turning to shortcomings such as barriers to data visibility and difficulties in performing analytics using object storage.

A new vendor survey finds that “object storage has gone mainstream,” with 72 percent of respondents using Amazon Web Services’ Simple Cloud Storage Service (S3). As AWS object storage explodes, use cases are shifting toward analytics and data lakes.

Nevertheless, Chaos Sumo, an object storage startup focused on cloud-based analytics and log data retention services, found that the transition to object storage creates another set of problems, including inconsistent predictive analytics. Overall, the tool vendor found that analytics and visibility barriers are critical to those planning to use object storage as a platform for business analytics.

While the survey found that the majority of S3 customers use it as “a cheap alternative to on-premises storage” for backup and archiving data, object storage is also widely used for application and media hosting, along with 32 percent of respondents who said they use object storage for business analytics.

Greater adoption of object storage for data lakes and expanding use cases have uncovered shortfalls, such as visibility into stored data, consistent analytics performance and the growing cost of moving large data volumes. Just over one-quarter of respondents said moving data in order to analyze it was their biggest challenge in managing S3 object storage.

The “increasing costs of storing data for real- or near-time analysis is the core impediment to doing more with the growing amount of data stored in object storage,” said Thomas Hazel, founder and CTO of Chaos Sumo, Somerville, Mass.

Hence, the savings provided by object storage platforms like S3 may be offset by rising computing and networking costs, according to Chaos Sumo’s survey of more than 120 data scientists, analytics and DevOps/IT managers released this week.

While about half of those surveyed said they are using Amazon Redshift data warehouse management along with S3, 42 percent said they are using home-grown tools to overcome data analytics and visibility issues. “These tools are not only inadequate at addressing the jobs needed to be done, they also take a lot of time to set up and manage,” they survey found. For example, 52 percent of respondents said it took them more than three months to build their current analytics architecture.

The survey conducted between December 2017 and January 2018 also found that one-third of those polled are using object storage to streamline their data lakes for applications like machine learning and historical trend analysis.

The shift to object storage also has become the latest front in the ongoing public cloud price wars as AWS (NASDAQ: AMZN) seeks to maintain its sizeable market share lead. Meanwhile, chief competitors Microsoft (NASDAQ: MSFT), Google (NASDAQ: GOOGL) and IBM (NYSE: IBM) look to differentiate their services. For example, market tracker 451 Research found last year that leading public cloud vendors were cutting their object storage prices in order to compete with AWS.

Last year, Boston-based Wasabi launched AWS-compatible object storage technology that it says costs 80 percent less than S3 while performing at 6X the speed. Last month, Wasabi took on the data movement issue by announcing a pricing plan with unlimited free egress, eliminating all charges except the basic charge for actual storage, according to the company.

"Egress charges have been one of the biggest inhibitors to enterprises moving their data to the cloud," the company said in a prepared announcement. "Customers dislike egress fees because they make it impossible to accurately predict how much they are going to spend, and inevitably creates vendor lock-in... Wasabi’s vision is to make cloud storage a simple one-size-fits-all utility, like electricity or bandwidth, and that billing should be as simple and transparent as possible."

With cloud-based data analytics applications growing along with adoption of data lakes, market analysts and vendors such as VMware (NYSE: VMW) note that databases have emerged as a top cloud workload. Relational databases are expected to be the “next competitive front” in the ongoing public cloud price wars.

Hence, the need for tools for real-time analytics beyond RedShift and Amazon Athena will be needed, a market that startups such as Chaos Sumo are now targeting.

↧

Latest Images