Apache Mahout

Apache Mahout is a library of code you might want to use in an analysis or search application. The complexity is enough to make you wait for a finished application using the library instead of writing your own. Unfortunately finished applications are rarely finished when first released and are oversold on capability. You need to understand the benefit and the limitations of the underlying technology before committing to use the results from any application.

Beta software

Apache Mahout is so early in the software development cycle that it is barely beta software and you use it at your own risk. There are some functions with significant testing and some that are brand new. Every release will make some functions mature while others will be fresh out of the oven. In fact some will be fresh out of the mixing bowel before going in the oven. You have to look at the testing of every function before use.

If you are working on a Mahout based project, you will know the history of each item in the Mahout library. I tried to work out the status and testing of some individual functions and it was all too hard. The best I could find was a division between core and test.

I started looking at Mahout for use in a CMS, Content Management System, because the CMS has an interface module. The interface module does not document what Mahout does. Mahout provides documentation then warns that some functions are new with little testing so I tried to find out more. I did not find anything obvious and, for my purpose, would have to treat everything as untested.

Fuzzy logic

Fuzzy logic was a fashionable term a few years ago and now people hate fuzzy logic because it did not do what software salespeople promised, sometimes it produced errors. Apache Mahout offers a range of functions including some that are best described as fuzzy logic. Fuzzy logic means we will guess the result if there is nothing obvious.

The ideal choice of search would find exactly what you want and, if there is not an exact match, give you the closest result. Fuzzy logic chooses any result that is a quick fit. Fuzzy logic may ignore an exact match because it does not enforce the type of rules required to find an exact match. Fuzzy logic may return a guess without telling you it is only a guess. You cannot trust the results of a fuzzy logic search or classification system. You have to have an additional validation or feedback process to check the results.

Apache Mahout library functions have to be carefully analysed to understand the accuracy and validity of their results when applied to your data.

Java

Apache Mahout is written in Java because Apache Mahout is designed to work with other software written in Java. When you have enough experience of using Java, you realise you have to extensively test Java based applications with different data and in different environments to make sure it will not crash. The cost of supporting Java is 50% or more higher than the alternatives. Clearly you would have to get a lot of benefit from the Apache Mahout library functions to cover the support costs.

Mixing Java Web software with other Web software creates additional installation and maintenance problems. Some of the interface modules for Java choose to leave Java in a separate server and communicate using Web services. If you use PHP, for example, Mahout can be accessed direct because PHP can call Java direct. PHP is equally good at calling Web services. You can put Mahout on a separate Web server focused on Java and maintained by a Java expert.

Frameworks

You can get packages that wrap the Apache Mahout library in a framework or Web service for use from applications written in mainstream programming languages. If you are not a Java programmer, separating the Java code from your Web site makes more sense than trying to maintain a mixture of code in two languages.

An interface framework gives you another advantage. The data you use in your code can stay in the same format all the way through your code. The framework can perform data conversions between your code and Mahout or Web services.

Drupal

Drupal is the worlds's most popular content management system for new Web sites bigger than a blog. Drupal is an example of an existing application connecting to Apache Mahout. First, Drupal will not depend on Apache Mahout, Apache Mahout will only be added as part of an optional add-on module and you are free to choose alternatives. Shopping carts and similar applications will have the chance to recommend exact matches, based on an understanding of the product range, before resorting to Apache Mahout.

As an example, a Web site selling disk drives knows that rotation speed is important when selecting a disk drive and disks have a small number of rotation speeds with 7200 being the most common fast disk. A disk drive shop can offer exact selection of rotation speed before resorting to a recommendation that may be imprecise.

Second, Drupal will use Apache Mahout through the Recommender module which is clearly designed to recommend a close match, not an exact match. A recommendation might answer the question fastest cheapest disk drive by listing cheap disk drives ordered from fastest to slowest. You can choose the recommendation or continue browsing. The important thing is the bit stating what the recommendation is based on.

Classification

Apache Mahout is used for data mining. A lot of functions reduce data for classification then analysis and reporting. A social Web site might classify people by country and gender to feed data into a marketing campaign and to decide what content should be used. What they might not realise is the number of families that share one login or the number of people from the middle east who login using the husband's id because there is a local belief that women should not communicate freely with the open world or the number of men in the western world who do not use their id when logging in because they work in the military and would be fired for expressing an independent opinion.

Most data is categorised then presented as if the classification and categorisation is accurate. When results are presented that are obviously wrong, the error is a deliberate attempt to make the results fit a marketing campaign instead of the other way round. Adobe want to sell products to create Flash files. Adobe tell you that 99% of Internet users use Flash. The Adobe figures are not based on Internet users. Instead the Adobe figures are based on a survey of Flash users. Apache Mahout results are only as good as the data plus the user's understanding of the data.

An example

There are lots of things you can do with Mahout. I mentioned a sales example. The most common use of recommendation software is to direct a customer to a product. A customer logs into your shop and browsers notebook cases. Your software can find a previous sale of a notebook to the same customer and feature notebooks designed for the size of notebook your customer previously purchased.

You do not need Mahout for something this simple. A typical sales system looks through the notebook brands to find the one with the biggest profit margin then looks through the range from that brand for the ones recommended by the manufacturer to work with that model notebook.

When would you use Mahout or an equivalent? When the decision becomes complicated. Your customer might have purchased several different notebook computers or none. You have to use other selection criteria. you start using age, gender, country, city, previous purchases by brand, price ranges, anything that might influence their decision.

Now make the analysis more complicated. You want to advertise a special offer. What product would best benefit your sales from a price reduction? Now you have to find products that can sell in volume to your existing customers and are held back only because of a price slightly higher than they will pay.

You might analyse every sales record from day one when you set up your business. You might analyse from the day you expanded to multiple brands. Adding a large analysis library to analyse your data gives you more choices and makes some things easier.

There is still the problem of deciding which tool you will use. You can insert a screw using a hammer but there might be a better tool in the toolbox.

Search engines

A big problem with large volumes of data is finding the right data. Google is one of the best engineered search engines but it often produces results that are close to useless. You see all sorts of problems. Google will put out of date information ahead of current information because the old Web pages have accumulated more links. Google puts quotes in blogs ahead of the original source because blogs have better keyword density.

There are some really easy ways to fix Google for many common types of search. Google does not offer the option to input critical factors. If they used Mahout, Mahout would let them put the critical factors in but Mahout will not tell them what the critical factors are. Neither does the documentation at the Mahout Web site.

You have to go back to your data and understand what the data means. You can then propose tests to prove the meaning. Mahout might provide the best code for the analysis in your test.

Conclusion

Forget Apache Mahout by itself. Look instead for pre-built connections into your applications that use Apache Mahout, they will give you the best benefit in the shortest time. When you do find a quick way into Apache Mahout, trace backwards to find the exact functions used and the exact reduction performed on your data to find if it is an exact result or just a guess.

Comments

Peter,

I don't think that you understand what Mahout is intended for and you clearly didn't spend enough time to notice the parts that are fully production ready.

First of all, if you want a recommendation system, Mahout provides an excellent one in the form of a Meccano kit ready for assembly. By nature, recommendation systems require quite a bit of tuning to figure out what data you have and how the recommender can be integrated into your overall system. Large companies like AOL use Mahout for this purpose with good results.

Secondly, Mahout provides quite a lot of advanced mathematical capability for data mining in the form of sharp tools intended for craftsmen who know what they are about. The fact that much of it is made available as it is developed is a boon, not a bane. This is not the same as web content management systems where the topic has been beaten to death and good approaches are both very simple and very well understood. Mahout is address frontier needs that you don't seem to have or understand. That doesn't mean that Mahout isn't useful or that you don't know your own needs. It just means that they differ.

Your comments that the designers of the classification algorithms in Mahout don't understand shared accounts are a good indication of this. There is quite a lot of machinery in Mahout to deal with noisy inputs caused by things like that and the people working on Mahout are *very* aware of issues like this. We have built some of the largest web sites and some of the largest web-based data mining applications around.

So lighten up a bit on stuff outside your area of expertise.

Hello Ted, Your comments suggest you know Apache Mahout at the code level or work on a project where there are people who work at the code level. I look at Mahout from the outside as someone considering plugging Mahout in to provide a function. I do not have Java programmers on board to inspect the code and have to follow the documentation, which warns that some functions are new and untested, but does not tell me how to identify the different levels of testing for each module.

Your Meccanno set analogy is useful. If you understand how a crane works, you can build a model of an industrial crane using a Meccano set. If you do not understand how a crane works, you can follow the instructions in a book but cannot diagnose problems or recognise ineffective construction. For that reason I recommend not diving into Mahout. Instead work from the other end. What do you want to achieve? What facilities are provided in your application or programming language? Can you find examples and tutorials for those facilities? If those facilities use Mahout, fine, use whatever they connect to. The important thing is to work through examples of the results at the point where you select options or create parameters or write code.

My experience with data mining is based on hand coding summation, analysis, sales analysis, and management reports for banks, oil companies, governments, just about every category of TLA organisation. The input might be a few billion records from many sources. The results might be several Excel charts.

One of the big problems is matching data that does not have matching identifiers. Another is matching data that is already summarised. You end up with a big mix of steps performing what is essentially magic to those who do not understand the process. The people using the results cannot always prove the results are valid or accurate.

Mapping out the process is the most important part. You then have to prove your software follows the map. I could not do that with Mahout based on the documentation I found for Mahout.

If someone asked me to analyse the efficiency of a distribution system, I would start with software that understands the data used in distribution systems. What I care about is the type of analysis and the provable degree of accuracy. I do not care if that software uses Mahout in the background and I will not hire a Java programmer to read the Java code. I will look for documentation at the distribution software level or treat the application as a black box and perform independent verification tests.

Look at the selection of tools from another point of view. All the tools used in arson are every well documented but, according to police, the easiest way to find an arsonist is to visit the burns unit in local hospitals. The documentation on the tools used is not oriented toward arson.

The documentation for Mahout might be extensive but it is not oriented to someone who wants to make their application better. It is not oriented to people not using Java. Where would you tell someone to start reading if they want to improve their analysis if none of the software in their system is written in Java?