- PeterMoulding.com
- Author
- Trainer
- Speaker
- Business Coach
- How to write a How To book
- PHP Courses
- Speaking
- Web Architect
- Australia
- Books
- Authors
- Akkana Peck
- Alex Berenson
- Andrew Nugent
- Ben Sanders
- Brock Clarke
- Chris Simms
- David Mercer
- Dianna Mullet
- Don Winslow
- Dori Smith
- Harlan Coben
- Jack McDevitt
- James Wines
- Jerry Yudelson
- John Grisham
- Kevin Mullet
- L. E. Modesitt Jr.
- Laurell K. Hamilton
- Marshall Karp
- Martina Cole
- Michael Marshall Smith
- Michel Roux Jr
- Nadia Sawalha
- Philip Pullman
- Raymond Khoury
- Richard North Patterson
- Robert Masello
- Sally Roth
- Sarah Langan
- Stella Rimington
- Stephen Booth
- Stephen King
- Stephen Leather
- T.C. Boyle
- Tom Negrino
- Tony Hillerman
- Urban Waite
- Val McDermid
- Valerio Massimo Manfredi
- Beginning GIMP
- Beginning Visual C++
- Culturalism
- Fiction
- A Drink Before The War
- A Talent for War
- Bag of Bones
- Blood and Ice
- Burn
- Dark Lady
- Dead Line
- Eclipse
- Empress of Eternity
- Exley
- Flipping Out
- Just One Look
- Nightfall
- Pet Sematary
- Savage Moon
- Skinwalkers
- Starvation Lake
- The Fallen
- The Gardens of the Dead
- The Jump
- The Last Templar
- The Mermaids Singing
- The Midnight Mayor
- The Secret Soldier
- The Summons
- The Terror of Living
- The Testament
- The Tower
- Under the Dome
- Virus
- AJAX and PHP
- Aging with Grace
- Food books
- Green Architecture
- Life Is So Good
- SQL: The Complete Reference
- The Backyard Bird Lover's Ultimate How-to Guide
- The Garden Gurus
- Authors
- Sustainability
- -18 hours left to decide the future of Australia
- Campbells vegetable stock or Massel vegetable stock?
- Carbon Sequestration
- Carbon tax for Australia is a fraud
- Copenhagen will fail
- Cost of living in Australia
- Dick Smith jumps on the population bandwagon
- Dry Run: Preventing the Next Urban Water Crisis
- Energy Saving Lights
- Garlic
- How many people can live in Australia?
- Its obsolete, throw it out!
- Julia Gillard offers 9.9 billion dollars bribe to Rob Oakeshott
- Laundry detergent
- Petrol or Diesel?
- Reflective foil batts kill
- RoHS
- Sea level to rise 3mm due to climate change
- Solar power
- Spring again in Sydney
- Sustainable fuels
- The CRUD Tax is back
- The people who make building regulations do not own houses
- Water efficiency
- Which insulation is safer, foil or wool?
- Will Australia reduce greenhouse gas emissions?
- Technology
- Android or Blackberry or iPhone or a flip phone?
- Apple versus Google 2011
- Cameras
- Cars
- Colour
- Burgundy
- Colour Blindness
- Colour Names
- Dulux colours
- Pantone colours
- Safe Colours
- Seculine ProDisk Mini colour balance card
- What Causes Colour Blindness?
- Hardware
- Batteries for the Digital Age
- Cables
- Cases
- Computer reliability
- Computrace
- Disks
- Astone ISO Gear 481E
- Best SSD for your notebook computer
- Disk block size
- Hitachi disk HDS722020ALA330
- LaCie USB 2.0 250 GB mobile hard drive design by F.A. Porsche
- SMART disk
- Samsung 2 TB HD204UI quiet low power disk for mass storage
- Seagate and Samsung merge disk business
- Select the right disk for your RAID array
- USB disk speed
- Western Digital WD20EARX 2 GB SATA 3 disk
- How long should computer hardware last?
- Keyboards
- Mainframe
- Memory cards
- Monitors
- Netbooks, notebooks, tablets, and xPads
- Network Attached Storage
- OLED Displays
- PC's are a thing of the past
- Printers
- Quiet
- Samsung Galaxy S
- Speed
- Television
- Tools
- USB
- Worst computer movies
- Xserve is dead. What next?
- Your backup will not work
- Z68 motherboards
- iPad or Acer Aspire One?
- IQ
- LG Intello Washing Machine
- Lack of a challenge
- Networks
- 802.11n wireless networking
- D-Link DIR-655 wireless router
- D-Link DWA-160 Xtreme N dual band USB adapter
- D-Link DWA-556 Xtreme N PCI Express desktop adapter
- MIMO
- NBN spends another $12 billion of our tax money on nothing
- National Broadband Network
- Netgear wireless modem router DGND3300 with 300 Mbps 802.11n
- Refrigerator kills wireless broadband
- Small Wireless Network
- TP-LINK TL-SG10005D 5 port gigabit switch
- TP-Link TL-WR1043N wireless N gigabit router
- Telstra Pre-paid Mobile Wi-Fi
- Where are the router plus proxy server combinations?
- Open Source documentation
- Software
- 7-zip
- Accounting
- Asterisk
- Audacity
- Backup software
- Bloat only in Windows
- CAD
- CDex
- Disk imaging software for copying and backup
- Exact Audio Copy
- Filezilla
- Firefox
- Java
- LibreOffice or OpenOffice?
- Linux
- 1 in 5 servers will ship with Linux
- Android phones outsell iPhone
- Another Move to Linux
- CentOS 5.5 installation on SSD and RAID 5
- Debian
- Debian 5.0.5 AMD64 installation
- Debian 5.06 installation
- Fedora
- Fedora or Ubuntu?
- Gnome or KDE?
- K9copy
- Linux 2.6.38
- Linux Gnome login settings lost
- Linux Mint
- Linux RAID, a rant
- Linux Speed
- Linux Time
- Linux reliability as demonstrated by Ubuntu 10.10
- Linux reliability as demonstrated by Ubuntu 11.4
- Linux still a struggle in 2011
- Linux workstation disk RAID 1
- Linux, NT, Windows, and SETI
- Linux, three years of progress
- London Stock Exchange switches to Linux
- Mandrake Linux 9.2
- The partition is misaligned by 48128 bytes - warning from Linux RAID
- Ubuntu
- How to fix the scroll bars in Ubuntu 11.4 Gnome
- Kubuntu 10.10 alternate installation on desktop with RAID 1
- POWbuntu
- Ubuntu 10.10 after 6 months use
- Ubuntu 10.10 alternate installation
- Ubuntu 10.10 desktop RAID 1
- Ubuntu 10.10 desktop RAID 5
- Ubuntu 10.10 desktop install on a netbook
- Ubuntu 10.10 desktop installation
- Ubuntu 10.10 netbook install on a netbook
- Ubuntu 10.10 server AMD64
- Ubuntu 10.10 upgrade to version 11.4 beta 2
- Ubuntu 10.4
- Ubuntu 11.10
- Ubuntu 11.10 first upgrade
- Ubuntu 11.4 after one month use
- Ubuntu 12.04 beta1 desktop amd64
- Ubuntu One
- Ubuntu by Microsoft?
- Ubuntu desktop upgrade 10.4 to 10.10 failed because I did not check the media
- Ubuntu strikes again
- Upgrade Ubuntu to Linux Mint 12 LDXE for extra speed
- Yes, use Linux but not that distribution!
- Nero
- OpenOffice
- OpenOffice is now Apache Office
- Project management
- Scribus
- Software for Windows and Linux
- Text editors
- Time
- Todo applications
- Tomboy notes
- Top text editors
- Version control
- VideoLAN VLC media player
- Visio
- Webmin
- Webmin installation on CentOS for Web development
- Webmin installation on Ubuntu
- What is the most popular open source software today?
- Windows
- Another Windows person goes Linux
- BAD_POOL_CALLER
- Cygwin
- Microsoft Malicious Software Removal Tool cannot find a common virus
- One of the developers of Windows XP is criminally insane
- There are unused icons on your desktop
- W32time
- Which Windows version?
- Windows 7 Home Premium
- Windows XP Stop 0x0000007B during installation
- Windows XP is a disaster
- Windows processes
- XML
- Zip, bzip, gzip, or 7zip?
- configFree
- Technology Succession Planning
- VoIP
- Web Sites
- Drupal
- Do Drupal themes have to use the GPL?
- Drupal 7
- A better search facility for Drupal
- Drupal - performance or flexibility
- Drupal 7 Fields are hard to fix
- Drupal 7 new features
- Drupal 7 ships on January 5
- Drupal 7.14
- Drupal 7.4 hits PeterMoulding.com
- Drupal function sequence
- The evolution of a module
- Undefined index: headers in DefaultMailSystem->mail() (line 54 of /modules/system/system.mail.inc).
- Undefined index: to in DefaultMailSystem->mail() (line 83 of /modules/system/system.mail.inc).
- implode(): Invalid arguments passed in DefaultMailSystem->format() (line 23 of /modules/system/system.mail.inc).
- Drupal 8
- Drupal Code Load Cut
- Drupal How To
- Drupal Modules
- Backup and Migrate
- Browscap
- CKEditor with Drupal WYSIWYG
- Captcha
- Cel
- Colorbox
- Content Construction Kit
- Content type
- Devel module for Drupal
- Drupal Rules as an automation language
- Drupal Spam add-on module
- Form alter to node
- IMCE
- IMCE Wysiwyg bridge
- ImageAPI
- Jdog
- Lightbox2
- Module variable
- Node Gallery Access
- Node_Gallery
- Path
- Path redirect
- Pathauto
- Pet
- Search
- Service links
- Session Variable
- Statistics
- Taxonomy
- Token
- Token ex
- Transliteration
- Trigger
- Watch
- Other modules
- Drupal Training
- Drupal access controls need a major rewrite
- Drupal coding tricks
- Drupal performance
- Drupal themes for the future
- Drupal.org colours
- Import existing data into Drupal
- Multiple Web sites made easy using Drupal multisite and the right start
- drupal_lookup_path()
- Adobe PDF
- Apache
- Apache Mahout
- Audi.com
- Bleet
- CSS Strikes Again
- CSS or xCSS
- Can you believe Facebook or email?
- Content Management Systems
- Databases
- Facebook scam
- Font
- Fonts
- HTML
- Install Apache, MySQL, and PHP 5 in Ubuntu 11.4 using the Ubuntu Software Centre
- Language Codes
- Marketing
- Memcache
- Nginx
- Open source development hits another roadblock
- Oscars
- PHP
- SPDY
- Search software
- Techoni.com.au
- Theme themes
- Things to hate on Web sites
- U.S. Patent No. 6,985,875
- Virtual Private Server
- Visible Improvement
- Web 4.0
- Web browser usage
- Web browsers
- Web site development
- Bluefish
- Crying over spilt code
- Eclipse and PHP
- Getting a Git client, a story of ancient technology and pain
- HTTrack
- MVC
- Netbeans
- PHP or ..., CakePHP/Symfony/ZF versus ...
- Programming
- Superfish
- Web browser emulators for testing your Web site
- Web development frameworks
- Web site books
- Web site development on your own computer
- Webmin or phpMyAdmin or cPanel for creating databases?
- aiki framework
- jQuery
- Views development - Learn Fields first
- Views development - Learn Actions and Rules
- jQuery .each()
- jQuery .has()
- jQuery .is()
- jQuery and Firefox Firebug
- jQuery children
- jQuery for people not using Drupal - Installation and getting started
- jQuery hover
- jQuery hover de-duplication example
- jQuery or CSS?
- jQuery performance
- jQuery tests
- Web site hosting
- Westpac Web site still broken after two years and ten months
- Wordpress wins another CMS survey
- Drupal
Apache Cassandra
Submitted by Peter on Thu, 2011-02-10 00:28
Apache Cassandra is software that presents a database table without the database. There is a fashion for replacing database software with software that does part of what database software does and claiming a speed improvement from the result. Apache Cassandra is one of the options and presents a single database table with some of the overheads of a database replaced by your code.
There are several alternatives to databases that are effectively indexed tables sitting in isolation. You have to write your own code to do what databases do. On rare occasions this manual approach is faster. On most occasions tuning your database will give you a better speed increase faster than rewriting your code. On some occasions no amount of rewriting your code and tables manually will match the performance of a database with hundreds of person years invested in development.
Apache Cassandra has a lot of people time tied up in development and usage but explained successes are rare. There are cases where people claim better performance from using Apache Cassandra but cannot explain what they changes or why it works. A real problem is the lack of a comparison to the same time spent tuning their original database or, in the case of MySQL, using a different database engine.
Two billion rows
I read a brag page about how Apache Cassandra handled a large amount of data on one table compared to the previous software used by the bragger. The Apache Cassandra specifications, or one of their case studies, said the table can handle two billion rows. A visitor made the comment that the two billion limit refers to columns, not rows. The introductory articles at the Cassandra Web site mention two billion columns and unlimited rows.
Two billion is the maximum positive value of a 32 bit signed integer. MySQL can use unsigned 32 bit integers as an index and a 32 bit unsigned integer can store four billion. Over the many years I have works on databases, there were several occasions when one of my applications hit four billion rows and I restructured a table or the table processing to reduce the number of rows. Two billion is not a big number. Twenty billion is not a big number.
One hundred billion is a big number for rows in a database table based on comments by people about their biggest tables.
In some occasions people are using PostgreSQL table partitioning to spread large tables over several servers. Looking through the Cassandra white papers, Cassandra users are doing the equivalent of partitioning PostgreSQL tables. There is a lot of flexibility in existing databases for handling large numbers of rows without switching to Cassandra.
Latest update
I am writing about Apache Cassandra 0.7 from cassandra.apache.org. Prior to 0.7, Apache Cassandra, like many other no database
tables, bragged about super performance and the lack of database indexes to slow things down. Apache Cassandra 0.7 brags about finally having full indexes. I think that makes Apache Cassandra 0.7 the same as a normal database.
The slowest database tables are slow because they have flexible indexes, row level locking, and data schemas. The schemas let you change the layout of a table row without taking your Web site offline. Row level locking lets you update part of a database table without stopping access to other parts of the table. Indexes give you faster selections at the expense of slower updates. Exactly the same things happen when you make the same facilities available for the table based software including Apache Cassandra.
Proven?
Cassandra is in use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX, and more companies that have large, active data sets. The largest production cluster has over 100 TB of data in over 150 machines.
Sounds impressive. All those big sites using Cassandra but how do they actually use them? Look through the articles at wiki.apache.org/cassandra/ArticlesAndPresentations.
Consistency
A common thread when talking about Cassandra is consistency. Several case studies and blogs contain lines similar to the following from Digg. The customer was happy to give up consistency in return for performance. Hey, you can get better performance out of relational databases by giving up consistency. Why switch to something else?Since it was already necessary to abandon data normalization and consistency to make these approaches work, we felt comfortable looking at more exotic, non-relational data stores.
Why is consistency important? When can you throw it out? Consistency is vital in a shopping site when recording stock quantities and money paid, a mistake costs you money and drives away a customer. Consistency is not important when displaying user ratings of products. When there are a lot of users rating a product, it does not matter if the current rating is 4.5, you can display the rating of 4 from 10 minutes ago. Ratings can be calculated using an irregular process running in the background and can distribute the updated ratings slowly across servers. You can achieve exactly that type of split processing while staying with MySQL and PostgreSQL. You would not switch to Cassandra just for that type of change. You would choose Cassandra only when you want to combine inconsistency with some other features of Cassandra.
Data normalisation
Cassandra is often used to replace normalised data with unnormalised data to improve retrieval speed. Look again at the line from Digg. They abandoned data normalisation.Since it was already necessary to abandon data normalization and consistency to make these approaches work, we felt comfortable looking at more exotic, non-relational data stores.
Data normalisation gives you accurate data. Denormalisation produces inaccurate data. A common approach with any database is to create original data normalised then create copies of the data denormalised for faster retrieval. You do not need Cassandra just for this feature. The example mentioned in the Digg blog connects user 1 to users 2, 3, and 4 through table A then connects user 2, 3, and 4 to other users through table B. Cassandra is effectively used to preselect the list and store the list for future use. You can do that with any database. MySQL MyISAM tables are a good choice because they have low read overheads. Cassandra may offer an advantage when they prebuilt lists are very large.
Everything in memory
Cassandra users often mention they store everything in memory. Several database products perform better when they have lots of memory. The database use of memory often depends on initial settings. Looking at the switch to Cassandra mentioned in some blogs and white papers, they did the equivalent to increasing the server memory by a huge amount, altering the database settings to use all the extra memory, then monitoring the result in fine detail for a long time to refine the settings. You can make MySQL and some other databases ten or more times faster with the same approach. This is something worth doing before rewriting your application to use new database software.
Alternatives
Digg switched to Cassandra after looking at several alternatives:After considering HBase, Hypertable, Cassandra, Tokyo Cabinet/Tyrant, Voldemort, and Dynomite, we settled on Cassandra.
Conclusion
Apache Cassandra offers the same trade off as the alternatives. You get less function than a database and you write your own code to perform the work previously performed by a database. Database tuning will initially give you bigger gains than replacing database code with your code. In some rare instances your code might be more efficiant than the database code and worth the massive investment you need to replace database tables with discrete tables of the Apache Cassandra type.









Comments
Not 2 billion rows, but 2 billion columns. Rows are unlimited.
You apparently do not understand what a column means in Cassandra. It is not 2 billion rows, but 2 billion columns. Number of rows are unlimited.