Well, it is the last day of the MySQL Users Conference. It’s been really interesting with lots of great sessions and keynotes. Today was a bit shorter and I think the programme was the weakest as far as sessions went but here are the ones I attended.
Rasmus gave a keynote today with the rather strange title “PHP on Hormones“. He showed us an example of “PHP v1.0” and gave four reasons why people contribute to open source:
- Self interest
- Self expression
- Hormones (oxytocin nature’s “trust hormone” is released through interaction with others)
- Improve the world
He made the good point that Web 2.0 is all about the contributions from the users and successful sites harness network effects and get better the more people use them in a way that caters to their own interest.
It’s not what people think about you, but rather what they think about themselves.
This is a fundamental law of human nature that is worth remembering. People like to feel important and, in general, are self-interested.
He went over a lot of the same stuff as he talked about in his other session on security (including defacing the conference website through its XSS vulnerability 🙂 ).
Another interesting keynote was Scaling MySQL at YouTube by Paul Tuckfield. This was packed with some fairly technical stuff and he didn’t quite have time to go through it all in enough detail but here are some key points:
The top reasons for YouTube scalability are:
- python
- memcache
- replication
They originally had a separate database running MyISAM to do click tracking and ran everything else on an InnoDB database with replication. YouTube have far more reads than writes so replication works well for them but it is a constant battle to keep replication lag down.
One way they worked around replication lag was to have separate replication pools – certain pages are sent to certain replicas. This lowered lag and increased cache affinity.
The other way to speed up replication is with hardware. Their standard DB boxes are:
- 4x2GHz cores
- 16GB RAM
- 12x10k RAID10
Paul is a big believer in the power of disks. He suggests rather than upgrading CPUs people should increase RAM and spindle count.
He also had some good suggestions for optimising drives and caching. Obviously if you can fit your DB in RAM that is great but otherwise writes should only be cached by the RAID controller and reads should only be cached by the database. This is because the database will have the largest cache so a hit in database cache means the lower caches went unused but a miss in the database cache can only miss in other caches because they’re smaller and not as smart.
Implementing High Availability with DRBD was an interesting discussion of the Distributed Replicated Block Device (DRBD). DRBD can give you more availability than standard replication but less than cluster. Five nines of uptime means your site has to be up for all but five minutes in a year.
DRBD is kind of like network RAID1. While standard replication is asynchronous, DRBD uses synchronous block replication. It is set up with heartbeat in a passive/active failover setup. You can still use replication with this by pointing the slaves at the floating IP.
John Allspaw gave a session on capacity planning at flickr. They use ganglia for monitoring which looks pretty cool and the following tools for deployments:
In the High Performance Data Warehousing with MySQL session, Brian Miezejewski made the point that partitioning is the key but the scheme should be flexible to grow without table rebuilds.
He suggested MyISAM should be used for most DW needs because the data files are smaller and they are easy to copy between servers and for most DW queries they are faster than InnoDB. However, there are problems with table locking for loading high volumes of real-time data, deletes will break concurrent inserts, there is no index clustering and table repairs on big tables take a LONG time.
InnoDB is good for high volume streaming inserts (although if you ONLY ever insert then you could use MyIsam). It is also good if ALL queries can take advantage of index clustering (innodb clusters the data with its primary key and lookups using the PK are faster than keyed lookups in myisam) but if you don’t use the primary keys performance can go down.
Primary keys in InnoDB should be kept small as they are stored with every secondary as well.
Brian also talked about the new Nitro storage engine. This proprietary engine seems to do some pretty amazing stuff and I was talking to Terry Christensen from Nitro Security about it the other day.
It can do very high insertion rates (20k inserts/sec) with simultaneous queries. Also, it has ultra-high performance on aggregate operations (sum, avg, etc) on index values. It uses some ingenious indexing for this. It runs well on 32bit machines and can use partial indexes in ways other engines cannot.
The only downsides seem to be that it doesn’t do transactions and it costs $15,000 per server (this includes a full MySQL Enterprise Edition). Still, the stats for this engine are very impressive.
For backing up very large tables, Brian suggests copying the data files or if you partition you only need to backup the partition that has changed. He also described a trick for building aggregates on the fly using insert on duplicates update. Eg given a sales aggregate table with primary key(STORE,DATE):
INSERT INTO SALES VALUES(STORE,DATE,AMOUNT,TOTAL,COUNT) ON DUPLICATE KEY TOTAL = TOTAL+AMOUNT COUNT = COUNT+1;
The closing keynote was by Pasha Sadri from Yahoo! talking about Yahoo! Pipes. This is really cool. It lets you stitch together feeds and web services to make your own customised feeds.
The conference has been tiring but worth it. There are a lot of really smart people here and it’s been good to see how web 2.0 companies like Digg and Flickr have dealt with scaling.