Traditional machine learning and data analysis tools include SAS, IBM SPSS, Weka, and the R language. They can perform in-depth analysis on small datasets — datasets that can be located in the memory of the node that runs the tool.
Second generation of machine learning tools including Mahout, Pentaho and RapidMiner. They can do what I call big data shallow analysis.
Attempts to scale traditional Hadoop-based machine learning tools, including the results of Revolution Analytics (RHadoop) and SAS on Hadoop, can be incorporated into second generation tools.
Visit here: Big Data Trends
Third generation tools such as Spark, Twister, HaLoop, Hama, and GraphLab. They can perform in-depth analysis of big data. Several recent efforts by traditional vendors, including SAS memory analysis, also fall into this category.
First generation machine learning tools / paradigms
Since the first generation tools have a large number of machine learning algorithms, they are suitable for deep analysis.
However, due to scalability limitations, they cannot handle large datasets such as terabytes or petabytes of data (assuming these tools are not distributed in nature).
In other words, they can be scaled vertically (you can increase the processing power of the node the tool is running on), but they cannot scale horizontally (not all of them can run in a cluster).
First generation tool vendors are addressing these limitations by building Hadoop connectors and providing clustering options, which means they are hard at work redesigning tools like R or SAS to scale horizontally.
Also to Read More about: Top Big Data Companies
Second generation machine learning tools / paradigms
Second generation tools (we can now call traditional machine learning tools like SAS first generation tools) such as Mahout (http://mahout.apache.org), Rapidminer and Pentaho, which use open source MapReduce. a product of algorithms in Hadoop that scale to large datasets.
These tools are still improving rapidly and are open source (especially Mahout). Mahout has a series of clustering and classification algorithms as well as a very good recommendation algorithm (Konstan and Riedl, 2012).
Hence, it can handle big data and there are already a large number of use cases in a production environment, mainly for recommendation systems.
Mahou’s assessment found that it only implements a small subset of machine learning algorithms – only 25 production-quality algorithms, and 8 to 9 are available in Hadoop, which means it can be used in Extend to Large Datasets.
These algorithms include linear regression, linear support vector machine, K-means clustering algorithm, etc. It provides a fast implementation of sequential logistic regression through parallel learning.
However, as others (like Quora.com) have pointed out, it does not implement machines with nonlinear support vectors and multivariate logistic regression (this is also called a discrete choice model).
However, I think that some machine learning algorithms are really difficult to implement in Hadoop, such as the support vector machine kernel function and the conjugate gradient method (CGD, it’s worth noting that Mahout implements stochastic gradient descent).
Others have pointed out this point, for example you can see the article by Professor Srirama (Srirama et al., 2012).
Here is a detailed comparison of Hadoop and Twister MR (Ekanayake et al., 2010) in iterative algorithms such as the conjugate gradient method and shows that the overhead of Hadoop is very obvious.
What do I mean by iterative? A group of entities that perform specific calculations wait for results from neighbors or other entities before moving on to the next iteration. CGD is the best example of iterative algorithms – each CGD can be decomposed into primitives,
I’ll explain what these three primitives are: the daxpy operation multiplies a vector x by a constant k, and then adds it to another vector y; ddot calculates the dot product of two vectors x and y; matmul will Multiply a matrix and a vector and then return another vector.
This means that each operation corresponds to a MapReduce operation. One iteration will have 6 MR operations.
At the end of the day, a CG operation will have 100 MR operations and GB of data interactions, although this is only a small matrix.
In fact, the cost of preparing each iteration (including the cost of loading data from HDFS into memory) is greater than the cost of the iteration itself, resulting in degraded MR performance in Hadoop.
In contrast, Twister distinguishes between static data and variable data so that the data can be in memory during the MR iteration.
Some second generation tools are extensions of traditional Hadoop-based tools. Such alternatives are RevolutionAnalytics products, which extend the R language in Hadoop and implement an extensible R runtime in Hadoop (Venkataraman et al., 2012).
SAS memory analysis as part of the high performance SAS analysis toolkit is another attempt at scaling traditional tools on Hadoop clusters.
However, the recently released version can not only run on Hadoop but also support Greenplum / Teradata, which should be considered a third generation machine learning method.
Another interesting product is being implemented by a start-up company called Concurrent Systems, which provides a Predictive Modeling Markup Language (PMML) runtime in Hadoop.
PMML is somewhat similar to XML, which allows the model to be stored in a descriptive language file. Traditional tools like R and SAS can save models in PMML files.
The Hadoop OS allows them to store these model files in the Hadoop cluster, so they are second generation tools / paradigms as well.
Third generation machine learning tools / paradigms
The limitations of Hadoop itself and its inappropriateness for certain types of applications have prompted researchers to come up with new alternatives.
Generation 3 tools mostly try to go beyond Hadoop to analyze different dimensions. I will be discussing various implementations based on three dimensions, namely machine learning algorithms, real-time analysis, and image processing.
These friends facilitate self-learning. This is not only Xiaobai’s meeting place, but also Daniel’s online answers! Invite newbies and experienced friends to join the group to learn , communicate and make progress together!
Iterative Machine Learning Algorithm
Researchers at the University of Berkeley have proposed an alternative: Spark (Zaharia et al., 2010), that is, in the field of big data, Spark is seen as a next generation data processing solution that replaces Hadoop.
The core idea of Spark, which differs from Hadoop, is in-memory computation, which allows data to be cached in memory between different iterations and interactions.
The main reason for developing Spark is that the commonly used MR method is only applicable to applications that can be expressed as an acyclic data stream, not to other programs, such as those that need to reuse working sets in iterations.
Therefore, they proposed this new method of cluster computing, which not only provides similar guarantees and resiliency for MR,
Researchers at Berkeley have come up with a technical solution called BDAS that can perform data analysis tasks between different nodes in a cluster.
The bottom-most component in BDAS is called Mesos, which is a cluster manager that performs task allocation and resource management for cluster tasks. The second component is the Tachyon filesystem built on Mesos.
Tachyon provides a distributed file system abstraction and an interface for file operations between clusters.
In the actual implementation, Spark as a computing tool is implemented on top of Tachyon and Mesos, although it can be implemented without Tachyon or even Mesos.
Shark, powered by Spark, provides a cluster-level structured query language abstraction similar to that provided by Hive on Hadoop. Zachariah and others.
HaLoop (Bu et al., 2010) also extends Hadoop to implement machine learning algorithms – it not only provides a programming abstraction to represent iterative applications, but also uses the concept of caching to exchange data between iterations and check fixed point, thereby increasing efficiency. Twister (http://iterativemapreduce.org) is a product similar to HaLoop.
Real-time analysis is the second dimension beyond the scope of Hadoop. Twitter Storm (I feel like the original text is flipped) is the strongest competitor in this area.
Storm is an extensible complex event handling engine that enables complex real-time operations based on streams of events. Storm cluster components include:
A spout used to read data from different data sources. There are HDFS spouts, Kafka spouts and TCP stream spouts.
Bolt, it is used to process data. They work on the stream. This is where streaming machine learning algorithms usually work.
Topology. This is application-specific spur and bolt integration — the topology runs across the cluster nodes.
In practice, if the architecture includes a Kafka cluster (a distributed queuing system from LinkedIn) as a high-speed data extractor and a Storm cluster for processing or analysis, its performance will be very good.
The Kafka spout is used. Quickly read data from the Kafka cluster. The Kafka cluster stores events in queues. Since the Storm cluster is busy with machine learning, this is necessary.
Thanks for watching. If you have any shortcomings, then criticize and correct. Finally, I wish all big data programmers facing bottlenecks to break through, and I wish you all the best in future work and interviews.