- PeterMoulding.com
- Author
- Trainer
- Speaker
- Business Coach
- How to write a How To book
- PHP Courses
- Speaking
- Web Architect
- Australia
- Books
- Authors
- Akkana Peck
- Alex Berenson
- Andrew Nugent
- Ben Sanders
- Brock Clarke
- Chris Simms
- David Mercer
- Dianna Mullet
- Don Winslow
- Dori Smith
- Harlan Coben
- Jack McDevitt
- James Wines
- Jerry Yudelson
- John Grisham
- Kevin Mullet
- L. E. Modesitt Jr.
- Laurell K. Hamilton
- Marshall Karp
- Martina Cole
- Michael Marshall Smith
- Michel Roux Jr
- Nadia Sawalha
- Philip Pullman
- Raymond Khoury
- Richard North Patterson
- Robert Masello
- Sally Roth
- Sarah Langan
- Stella Rimington
- Stephen Booth
- Stephen King
- Stephen Leather
- T.C. Boyle
- Tom Negrino
- Tony Hillerman
- Urban Waite
- Val McDermid
- Valerio Massimo Manfredi
- Beginning GIMP
- Beginning Visual C++
- Culturalism
- Fiction
- A Drink Before The War
- A Talent for War
- Bag of Bones
- Blood and Ice
- Burn
- Dark Lady
- Dead Line
- Eclipse
- Empress of Eternity
- Exley
- Flipping Out
- Just One Look
- Nightfall
- Pet Sematary
- Savage Moon
- Skinwalkers
- Starvation Lake
- The Fallen
- The Gardens of the Dead
- The Jump
- The Last Templar
- The Mermaids Singing
- The Midnight Mayor
- The Secret Soldier
- The Summons
- The Terror of Living
- The Testament
- The Tower
- Under the Dome
- Virus
- AJAX and PHP
- Aging with Grace
- Food books
- Green Architecture
- Life Is So Good
- SQL: The Complete Reference
- The Backyard Bird Lover's Ultimate How-to Guide
- The Garden Gurus
- Authors
- Sustainability
- -18 hours left to decide the future of Australia
- Campbells vegetable stock or Massel vegetable stock?
- Carbon Sequestration
- Carbon tax for Australia is a fraud
- Copenhagen will fail
- Cost of living in Australia
- Dick Smith jumps on the population bandwagon
- Dry Run: Preventing the Next Urban Water Crisis
- Energy Saving Lights
- Garlic
- How many people can live in Australia?
- Its obsolete, throw it out!
- Julia Gillard offers 9.9 billion dollars bribe to Rob Oakeshott
- Laundry detergent
- Petrol or Diesel?
- Reflective foil batts kill
- RoHS
- Sea level to rise 3mm due to climate change
- Solar power
- Spring again in Sydney
- Sustainable fuels
- The CRUD Tax is back
- The people who make building regulations do not own houses
- Water efficiency
- Which insulation is safer, foil or wool?
- Will Australia reduce greenhouse gas emissions?
- Technology
- Android or Blackberry or iPhone or a flip phone?
- Apple versus Google 2011
- Cameras
- Cars
- Colour
- Burgundy
- Colour Blindness
- Colour Names
- Dulux colours
- Pantone colours
- Safe Colours
- Seculine ProDisk Mini colour balance card
- What Causes Colour Blindness?
- Hardware
- Batteries for the Digital Age
- Cables
- Cases
- Computer reliability
- Computrace
- Disks
- Astone ISO Gear 481E
- Best SSD for your notebook computer
- Disk block size
- Hitachi disk HDS722020ALA330
- LaCie USB 2.0 250 GB mobile hard drive design by F.A. Porsche
- SMART disk
- Samsung 2 TB HD204UI quiet low power disk for mass storage
- Seagate and Samsung merge disk business
- Select the right disk for your RAID array
- USB disk speed
- Western Digital WD20EARX 2 GB SATA 3 disk
- How long should computer hardware last?
- Keyboards
- Mainframe
- Memory cards
- Monitors
- Netbooks, notebooks, tablets, and xPads
- Network Attached Storage
- OLED Displays
- PC's are a thing of the past
- Printers
- Quiet
- Samsung Galaxy S
- Speed
- Television
- Tools
- USB
- Worst computer movies
- Xserve is dead. What next?
- Your backup will not work
- Z68 motherboards
- iPad or Acer Aspire One?
- IQ
- LG Intello Washing Machine
- Lack of a challenge
- Networks
- 802.11n wireless networking
- D-Link DIR-655 wireless router
- D-Link DWA-160 Xtreme N dual band USB adapter
- D-Link DWA-556 Xtreme N PCI Express desktop adapter
- MIMO
- NBN spends another $12 billion of our tax money on nothing
- National Broadband Network
- Netgear wireless modem router DGND3300 with 300 Mbps 802.11n
- Refrigerator kills wireless broadband
- Small Wireless Network
- TP-LINK TL-SG10005D 5 port gigabit switch
- TP-Link TL-WR1043N wireless N gigabit router
- Telstra Pre-paid Mobile Wi-Fi
- Where are the router plus proxy server combinations?
- Open Source documentation
- Software
- 7-zip
- Accounting
- Asterisk
- Audacity
- Backup software
- Bloat only in Windows
- CAD
- CDex
- Disk imaging software for copying and backup
- Exact Audio Copy
- Filezilla
- Firefox
- Java
- LibreOffice or OpenOffice?
- Linux
- 1 in 5 servers will ship with Linux
- Android phones outsell iPhone
- Another Move to Linux
- CentOS 5.5 installation on SSD and RAID 5
- Debian
- Debian 5.0.5 AMD64 installation
- Debian 5.06 installation
- Fedora
- Fedora or Ubuntu?
- Gnome or KDE?
- K9copy
- Linux 2.6.38
- Linux Gnome login settings lost
- Linux Mint
- Linux RAID, a rant
- Linux Speed
- Linux Time
- Linux reliability as demonstrated by Ubuntu 10.10
- Linux reliability as demonstrated by Ubuntu 11.4
- Linux still a struggle in 2011
- Linux workstation disk RAID 1
- Linux, NT, Windows, and SETI
- Linux, three years of progress
- London Stock Exchange switches to Linux
- Mandrake Linux 9.2
- The partition is misaligned by 48128 bytes - warning from Linux RAID
- Ubuntu
- How to fix the scroll bars in Ubuntu 11.4 Gnome
- Kubuntu 10.10 alternate installation on desktop with RAID 1
- POWbuntu
- Ubuntu 10.10 after 6 months use
- Ubuntu 10.10 alternate installation
- Ubuntu 10.10 desktop RAID 1
- Ubuntu 10.10 desktop RAID 5
- Ubuntu 10.10 desktop install on a netbook
- Ubuntu 10.10 desktop installation
- Ubuntu 10.10 netbook install on a netbook
- Ubuntu 10.10 server AMD64
- Ubuntu 10.10 upgrade to version 11.4 beta 2
- Ubuntu 10.4
- Ubuntu 11.10
- Ubuntu 11.10 first upgrade
- Ubuntu 11.4 after one month use
- Ubuntu 12.04 beta1 desktop amd64
- Ubuntu One
- Ubuntu by Microsoft?
- Ubuntu desktop upgrade 10.4 to 10.10 failed because I did not check the media
- Ubuntu strikes again
- Upgrade Ubuntu to Linux Mint 12 LDXE for extra speed
- Yes, use Linux but not that distribution!
- Nero
- OpenOffice
- OpenOffice is now Apache Office
- Project management
- Scribus
- Software for Windows and Linux
- Text editors
- Time
- Todo applications
- Tomboy notes
- Top text editors
- Version control
- VideoLAN VLC media player
- Visio
- Webmin
- Webmin installation on CentOS for Web development
- Webmin installation on Ubuntu
- What is the most popular open source software today?
- Windows
- Another Windows person goes Linux
- BAD_POOL_CALLER
- Cygwin
- Microsoft Malicious Software Removal Tool cannot find a common virus
- One of the developers of Windows XP is criminally insane
- There are unused icons on your desktop
- W32time
- Which Windows version?
- Windows 7 Home Premium
- Windows XP Stop 0x0000007B during installation
- Windows XP is a disaster
- Windows processes
- XML
- Zip, bzip, gzip, or 7zip?
- configFree
- Technology Succession Planning
- VoIP
- Web Sites
- Drupal
- Do Drupal themes have to use the GPL?
- Drupal 7
- A better search facility for Drupal
- Drupal - performance or flexibility
- Drupal 7 Fields are hard to fix
- Drupal 7 new features
- Drupal 7 ships on January 5
- Drupal 7.14
- Drupal 7.4 hits PeterMoulding.com
- Drupal function sequence
- The evolution of a module
- Undefined index: headers in DefaultMailSystem->mail() (line 54 of /modules/system/system.mail.inc).
- Undefined index: to in DefaultMailSystem->mail() (line 83 of /modules/system/system.mail.inc).
- implode(): Invalid arguments passed in DefaultMailSystem->format() (line 23 of /modules/system/system.mail.inc).
- Drupal 8
- Drupal Code Load Cut
- Drupal How To
- Drupal Modules
- Backup and Migrate
- Browscap
- CKEditor with Drupal WYSIWYG
- Captcha
- Cel
- Colorbox
- Content Construction Kit
- Content type
- Devel module for Drupal
- Drupal Rules as an automation language
- Drupal Spam add-on module
- Form alter to node
- IMCE
- IMCE Wysiwyg bridge
- ImageAPI
- Jdog
- Lightbox2
- Module variable
- Node Gallery Access
- Node_Gallery
- Path
- Path redirect
- Pathauto
- Pet
- Search
- Service links
- Session Variable
- Statistics
- Taxonomy
- Token
- Token ex
- Transliteration
- Trigger
- Watch
- Other modules
- Drupal Training
- Drupal access controls need a major rewrite
- Drupal coding tricks
- Drupal performance
- Drupal themes for the future
- Drupal.org colours
- Import existing data into Drupal
- Multiple Web sites made easy using Drupal multisite and the right start
- drupal_lookup_path()
- Adobe PDF
- Apache
- Apache Mahout
- Audi.com
- Bleet
- CSS Strikes Again
- CSS or xCSS
- Can you believe Facebook or email?
- Content Management Systems
- Databases
- Facebook scam
- Font
- Fonts
- HTML
- Install Apache, MySQL, and PHP 5 in Ubuntu 11.4 using the Ubuntu Software Centre
- Language Codes
- Marketing
- Memcache
- Nginx
- Open source development hits another roadblock
- Oscars
- PHP
- SPDY
- Search software
- Techoni.com.au
- Theme themes
- Things to hate on Web sites
- U.S. Patent No. 6,985,875
- Virtual Private Server
- Visible Improvement
- Web 4.0
- Web browser usage
- Web browsers
- Web site development
- Bluefish
- Crying over spilt code
- Eclipse and PHP
- Getting a Git client, a story of ancient technology and pain
- HTTrack
- MVC
- Netbeans
- PHP or ..., CakePHP/Symfony/ZF versus ...
- Programming
- Superfish
- Web browser emulators for testing your Web site
- Web development frameworks
- Web site books
- Web site development on your own computer
- Webmin or phpMyAdmin or cPanel for creating databases?
- aiki framework
- jQuery
- Views development - Learn Fields first
- Views development - Learn Actions and Rules
- jQuery .each()
- jQuery .has()
- jQuery .is()
- jQuery and Firefox Firebug
- jQuery children
- jQuery for people not using Drupal - Installation and getting started
- jQuery hover
- jQuery hover de-duplication example
- jQuery or CSS?
- jQuery performance
- jQuery tests
- Web site hosting
- Westpac Web site still broken after two years and ten months
- Wordpress wins another CMS survey
- Drupal
Apache Mahout
Submitted by Peter on Fri, 2011-05-06 23:39
Apache Mahout is a library of code you might want to use in an analysis or search application. The complexity is enough to make you wait for a finished application using the library instead of writing your own. Unfortunately finished applications are rarely finished when first released and are oversold on capability. You need to understand the benefit and the limitations of the underlying technology before committing to use the results from any application.
Beta software
Apache Mahout is so early in the software development cycle that it is barely beta software and you use it at your own risk. There are some functions with significant testing and some that are brand new. Every release will make some functions mature while others will be fresh out of the oven. In fact some will be fresh out of the mixing bowel before going in the oven. You have to look at the testing of every function before use.
If you are working on a Mahout based project, you will know the history of each item in the Mahout library. I tried to work out the status and testing of some individual functions and it was all too hard. The best I could find was a division between core
and test
.
I started looking at Mahout for use in a CMS, Content Management System, because the CMS has an interface module. The interface module does not document what Mahout does. Mahout provides documentation then warns that some functions are new with little testing so I tried to find out more. I did not find anything obvious and, for my purpose, would have to treat everything as untested.
Fuzzy logic
Fuzzy logic was a fashionable term a few years ago and now people hate fuzzy logic because it did not do what software salespeople promised, sometimes it produced errors. Apache Mahout offers a range of functions including some that are best described as fuzzy logic. Fuzzy logic means we will guess the result if there is nothing obvious
.
The ideal choice of search would find exactly what you want and, if there is not an exact match, give you the closest result. Fuzzy logic chooses any result that is a quick fit. Fuzzy logic may ignore an exact match because it does not enforce the type of rules required to find an exact match. Fuzzy logic may return a guess without telling you it is only a guess. You cannot trust the results of a fuzzy logic search or classification system. You have to have an additional validation or feedback process to check the results.
Apache Mahout library functions have to be carefully analysed to understand the accuracy and validity of their results when applied to your data.
Java
Apache Mahout is written in Java because Apache Mahout is designed to work with other software written in Java. When you have enough experience of using Java, you realise you have to extensively test Java based applications with different data and in different environments to make sure it will not crash. The cost of supporting Java is 50% or more higher than the alternatives. Clearly you would have to get a lot of benefit from the Apache Mahout library functions to cover the support costs.
Mixing Java Web software with other Web software creates additional installation and maintenance problems. Some of the interface modules for Java choose to leave Java in a separate server and communicate using Web services. If you use PHP, for example, Mahout can be accessed direct because PHP can call Java direct. PHP is equally good at calling Web services. You can put Mahout on a separate Web server focused on Java and maintained by a Java expert.
Frameworks
You can get packages that wrap the Apache Mahout library in a framework or Web service for use from applications written in mainstream programming languages. If you are not a Java programmer, separating the Java code from your Web site makes more sense than trying to maintain a mixture of code in two languages.
An interface framework gives you another advantage. The data you use in your code can stay in the same format all the way through your code. The framework can perform data conversions between your code and Mahout or Web services.
Drupal
Drupal is the worlds's most popular content management system for new Web sites bigger than a blog. Drupal is an example of an existing application connecting to Apache Mahout. First, Drupal will not depend on Apache Mahout, Apache Mahout will only be added as part of an optional add-on module and you are free to choose alternatives. Shopping carts and similar applications will have the chance to recommend exact matches, based on an understanding of the product range, before resorting to Apache Mahout.
As an example, a Web site selling disk drives knows that rotation speed is important when selecting a disk drive and disks have a small number of rotation speeds with 7200 being the most common fast disk. A disk drive shop can offer exact selection of rotation speed before resorting to a recommendation that may be imprecise.
Second, Drupal will use Apache Mahout through the Recommender module which is clearly designed to recommend a close match, not an exact match. A recommendation might answer the question fastest cheapest disk drive by listing cheap disk drives ordered from fastest to slowest. You can choose the recommendation or continue browsing. The important thing is the bit stating what the recommendation is based on.
Classification
Apache Mahout is used for data mining. A lot of functions reduce data for classification then analysis and reporting. A social Web site might classify people by country and gender to feed data into a marketing campaign and to decide what content should be used. What they might not realise is the number of families that share one login or the number of people from the middle east who login using the husband's id because there is a local belief that women should not communicate freely with the open world or the number of men in the western world who do not use their id when logging in because they work in the military and would be fired for expressing an independent opinion.
Most data is categorised then presented as if the classification and categorisation is accurate. When results are presented that are obviously wrong, the error is a deliberate attempt to make the results fit a marketing campaign instead of the other way round. Adobe want to sell products to create Flash files. Adobe tell you that 99% of Internet users use Flash. The Adobe figures are not based on Internet users. Instead the Adobe figures are based on a survey of Flash users. Apache Mahout results are only as good as the data plus the user's understanding of the data.
An example
There are lots of things you can do with Mahout. I mentioned a sales example. The most common use of recommendation software is to direct a customer to a product. A customer logs into your shop and browsers notebook cases. Your software can find a previous sale of a notebook to the same customer and feature notebooks designed for the size of notebook your customer previously purchased.
You do not need Mahout for something this simple. A typical sales system looks through the notebook brands to find the one with the biggest profit margin then looks through the range from that brand for the ones recommended by the manufacturer to work with that model notebook.
When would you use Mahout or an equivalent? When the decision becomes complicated. Your customer might have purchased several different notebook computers or none. You have to use other selection criteria. you start using age, gender, country, city, previous purchases by brand, price ranges, anything that might influence their decision.
Now make the analysis more complicated. You want to advertise a special offer. What product would best benefit your sales from a price reduction? Now you have to find products that can sell in volume to your existing customers and are held back only because of a price slightly higher than they will pay.
You might analyse every sales record from day one when you set up your business. You might analyse from the day you expanded to multiple brands. Adding a large analysis library to analyse your data gives you more choices and makes some things easier.
There is still the problem of deciding which tool you will use. You can insert a screw using a hammer but there might be a better tool in the toolbox.
Search engines
A big problem with large volumes of data is finding the right data. Google is one of the best engineered search engines but it often produces results that are close to useless. You see all sorts of problems. Google will put out of date information ahead of current information because the old Web pages have accumulated more links. Google puts quotes in blogs ahead of the original source because blogs have better keyword density.
There are some really easy ways to fix Google for many common types of search. Google does not offer the option to input critical factors. If they used Mahout, Mahout would let them put the critical factors in but Mahout will not tell them what the critical factors are. Neither does the documentation at the Mahout Web site.
You have to go back to your data and understand what the data means. You can then propose tests to prove the meaning. Mahout might provide the best code for the analysis in your test.
Conclusion
Forget Apache Mahout by itself. Look instead for pre-built connections into your applications that use Apache Mahout, they will give you the best benefit in the shortest time. When you do find a quick way into Apache Mahout, trace backwards to find the exact functions used and the exact reduction performed on your data to find if it is an exact result or just a guess.









Comments
Nice try ...
Peter,
I don't think that you understand what Mahout is intended for and you clearly didn't spend enough time to notice the parts that are fully production ready.
First of all, if you want a recommendation system, Mahout provides an excellent one in the form of a Meccano kit ready for assembly. By nature, recommendation systems require quite a bit of tuning to figure out what data you have and how the recommender can be integrated into your overall system. Large companies like AOL use Mahout for this purpose with good results.
Secondly, Mahout provides quite a lot of advanced mathematical capability for data mining in the form of sharp tools intended for craftsmen who know what they are about. The fact that much of it is made available as it is developed is a boon, not a bane. This is not the same as web content management systems where the topic has been beaten to death and good approaches are both very simple and very well understood. Mahout is address frontier needs that you don't seem to have or understand. That doesn't mean that Mahout isn't useful or that you don't know your own needs. It just means that they differ.
Your comments that the designers of the classification algorithms in Mahout don't understand shared accounts are a good indication of this. There is quite a lot of machinery in Mahout to deal with noisy inputs caused by things like that and the people working on Mahout are *very* aware of issues like this. We have built some of the largest web sites and some of the largest web-based data mining applications around.
So lighten up a bit on stuff outside your area of expertise.
I added some notes to the page in respose to your feedback
Hello Ted, Your comments suggest you know Apache Mahout at the code level or work on a project where there are people who work at the code level. I look at Mahout from the outside as someone considering plugging Mahout in to provide a function. I do not have Java programmers on board to inspect the code and have to follow the documentation, which warns that some functions are new and untested, but does not tell me how to identify the different levels of testing for each module.
Your Meccanno set analogy is useful. If you understand how a crane works, you can build a model of an industrial crane using a Meccano set. If you do not understand how a crane works, you can follow the instructions in a book but cannot diagnose problems or recognise ineffective construction. For that reason I recommend not diving into Mahout. Instead work from the other end. What do you want to achieve? What facilities are provided in your application or programming language? Can you find examples and tutorials for those facilities? If those facilities use Mahout, fine, use whatever they connect to. The important thing is to work through examples of the results at the point where you select options or create parameters or write code.
My experience with data mining is based on hand coding summation, analysis, sales analysis, and management reports for banks, oil companies, governments, just about every category of TLA organisation. The input might be a few billion records from many sources. The results might be several Excel charts.
One of the big problems is matching data that does not have matching identifiers. Another is matching data that is already summarised. You end up with a big mix of steps performing what is essentially magic to those who do not understand the process. The people using the results cannot always prove the results are valid or accurate.
Mapping out the process is the most important part. You then have to prove your software follows the map. I could not do that with Mahout based on the documentation I found for Mahout.
If someone asked me to analyse the efficiency of a distribution system, I would start with software that understands the data used in distribution systems. What I care about is the type of analysis and the provable degree of accuracy. I do not care if that software uses Mahout in the background and I will not hire a Java programmer to read the Java code. I will look for documentation at the distribution software level or treat the application as a black box and perform independent verification tests.
Look at the selection of tools from another point of view. All the tools used in arson are every well documented but, according to police, the easiest way to find an arsonist is to visit the burns unit in local hospitals. The documentation on the tools used is not oriented toward arson.
The documentation for Mahout might be extensive but it is not oriented to someone who wants to make their application better. It is not oriented to people not using Java. Where would you tell someone to start reading if they want to improve their analysis if none of the software in their system is written in Java?