Nippon Telegraph & Telephone Corporation (NTT, of Chiyoda Ward Tokyo, CEO: Satoshi Miura), and Preferred Infrastructure Corporation(PFI, of Bunkyo,Ward, Tokyo, CEO: Toru Nishikawa) have developed an infrastructure technology called “Jubatus” (1st Edition)*2, which is capable of high-speed, real-time analysis of large-scale data, referred to as “Big Data” *1.
Conventional batch processing methods periodically process data in batches and put newly arrived data on hold until the next batch execution.These methods are inadequate for Big Data applications such as real-time trend analysis, for which the timeliness of data is a critical requirement.
By providing the capability of analyzing the latest data in real time,Jubatus can help create value-added services in a wide range of areassuch as fraud detection, forecasting of market, economic and stock prices, natural disaster prediction, parts and materials procurement estimation for manufacturing, health-risk assessment,and predictive techniques in life and natural sciences.
This development is a result of open innovation between NTT Information Sharing Platform Laboratories and PFI Corporation.It will be released as open source on October 27 on the Jubatus OSS Web site, http://jubat.us/, as a public domain software contributing to the utilization of Big Data.
(1) MIX Processing system
This processing system has the following three functions.
① MIX Computation: Arranges the aggregate computation logic,depending on the data analysis logic.
② MIX Protocol control: Determines how data is aggregated and redistributed when checking intermediate analysis results among the servers.
③ Membership management: Performs tasks such as recovery from server faults and adding more servers in order to ensure continuous data processing, before data overflow can occur.
Even with simultaneous parallel analysis, having all servers wait for each other to compare intermediate results at each iteration will clearly result in a bottleneck. We were able to ensure that each server can run autonomously without slowing down by having servers exchange and mix intermediate results with other servers at suitable time intervals, rather than at every iteration. The balance in achieving both real-time nature and scalability is adjusted within the range allowable by the application,so that the precision and strictness (overall consistency) of the aggregate results can be relaxed (Figure 3).
(2) Pluggable architecture
Analysis engines, analysis modules, and data storage methods (local,distributed) can be combined and rearranged flexibly (plugged-in, out) due to the definition of shared interfaces.
(3) Workflow definition
It is possible to define and control execution of paths and parallel execution between process components easily and flexibly, from data input, to applied analysis, analysis engine and others. At this time, we have implemented and evaluated a multi-value classifier for online machine learning as the first instance of analysis engines for Jubatus.
In order to further advance R&D and contribute to the development of information processing technology for Big Data, NTT and PFI Corp. are working to promote the spread of real-time large-scale data analysis infrastructure and related business by expanding the Jubatus community and businesses built on it.
We are considering an “SNS analysis application”service in particular. This application will perform sophisticated analysis,such as categorization, fuzzy search, real-time filtering, and relevancy ranking, of the large volumes of real-time SNS data generated every day,so that it can be used for marketing and other applications. Figure 4 illustrates the concept of SNS analysis applications using Jubatus. Other applications include: “Sensor data analysis” “POS data analysis” “Log data analysis” “Financial data analysis” “Behavioral analysis”