It was the third day of the MySQL Users Conference and it was just as packed with good stuff as the days before.
Eben Moglen, Professor of Law and Legal History at Columbia University Law School and General Counsel of the Free Software Foundation gave a very thoughtful and interesting keynote entitled “Freedom Businesses Protect Privacy”. He talked about how the type of information stored has changed over the centuries and how the most powerful holders of information in the 21st century are private data mining organisations (using data voluntarily given) – not governments. He also made the interesting point that if Alice uses, say, gmail and Bob sends her an email then he too is being data mined even though it was Alice that shared her data.
His suggestion was to protect our private information by “storing it yourself” and then using freely available tools to mine your own data. I’m not sure I agree. Firstly, I don’t really want to store and analyse all my purchases at Amazon to determine new titles I might be interested in. That just isn’t going to happen. And secondly, don’t I really only accrue the benefit of data mining when thousands of other users’ data can be aggregated to determine trends? I’m not sure how useful a single person’s data is, especially to that person. Don’t get me wrong, I am all for protecting my private data but when you collect a sufficiently large sample of suitably anonymised data you can create services of real benefit to those who share their data. That said, I think the statistic of 1.6 CCTV cameras for every UK citizen is fucking obscene.
There were plenty of good sessions today too. This conference is really packed with smart people giving interesting talks.
Federation at flickr by Dathan Pattishall was a packed session. Flickr is different from a lot of the web apps being discussed here because it is quite write intensive. To deal with this they required a multi-master setup – something that is a little tricky with MySQL. They also needed no single point of failure and had to be able to serve pages FAST.
Federation refers to the storage engine. Basically it is another sharding setup. The main components of the setup are:
- The shards (slices of the main database) – 2 servers in master-master replication
- Global Ring or Lookup Ring which provides the ability for the app to access a shard. It is kind of like DNS.
- PHP logic to connect to the shards (10 lines of PHP code!)
The shards are setup in Active Master-Master Ring Replication. Because of this they needed to externalise the autoincrement IDs so they are globally unique. There are two servers using circular replication for this. These are called the “ticket” servers and they have a single row containing the max ID. They use the new 5.0 autoincrement offset stuff.
Users are assigned to a shard at random. This is quite balanced because they have so many new users. Occasionally they need to migrate users to balance data but this is quite easy. A user is always sent to a single machine during their session so they don’t get replication lag, ie not seeing changes immediately.
Each server in a shard is 50% loaded so they can shut down half the servers in a shard for maintenance.
When searching they need to combine results from shards so if one shard is slow the whole query is slow. To avoid this they use two search back ends. Most search is handled by proprietary yahoo search but some real time searches go to the shards.
The servers are 64bit RHEL4 16GB RAM 15K RPM RAID10 and their data size is about 12TB of meta data (not including the image data which is stored on a filer.) They use InnoDB file per table so space can be reclaimed from deletes. It is also easier for backups and data migration.
Database configuration:
- Max connections: 400 per server in shard (800 across both servers in a shard)
- Thread concurrency 32
- 5% of 12TB in memory so 13GB for innodb_buffer_pool
When doing a schema roll out they shutdown half the servers in a shard and force traffic to other server, do the ALTER TABLE, then bring it up and do the other server.
Both of the Amazon AWS sessions were scheduled at the same time. Luckily I went to the S3 storage engine first which ran short so I could sneak into the other session too.
Mark Atwood has created a new storage engine that uses Amazon’s S3 service. Unfortunately it is not yet a replacement for a general SQL engine but it is good for storing and serving big BLOBs like images. Simple selects showed very good response times. Here is how S3 translates to MySQL:
- AWS Account -> CREATE SERVER
- S3 Bucket -> Table
- S3 Item -> Row
- S3 Item Key -> VARCHAR PRIMARY KEY
- S3 Item Contents -> BLOB or VARCHAR
Here is how you would create a table:
CREATE TABLE 'bierce' ( 'word' VARCHAR(255) NOT NULL PRIMARY KEY, 'defn' BLOB) CHARSET=utf-8 ENGINE=AWSS3 CONNECTION='awss3 DevilDictionary $server MyAWSAcct';
There are some problems like:
- Not transactional
- No temporal guarantees – IE it could take some time for someone connected to a different AWS datacenter to see your latest insert
- Bad queries can cost real money. Full table scans are bad.
This engine can only be used with 5.1.
When Mark’s talk finished I went over to the session by Gururaja Rao from wipro on running cluster on Amazon EC2. I missed the first bit of this session but after speaking with him afterwards it looks like they have been very successful running cluster on EC2 (although only 5.0 not 5.1’s on disk NDB so far). His plan is to build a service that monitors the state of the cluster and automatically provision a new EC2 instance if one should fail. His architecture also involves running backups to S3 for disaster recovery. I asked him if there was a limit to the size of the cluster you can run and he didn’t seem to think there would be a problem running a 100 node cluster. Not sure how easy that would be to manage though…
This sounds really cool and would certainly be great for high availability. However, what would be even cooler is if this could be combined with the ability to automatically add nodes to the cluster depending on the server load. I think this would be harder though as I don’t think you can just add another node to MySQL without restarting the cluster. I think this is planned for 5.2.
There is a also the small chance that you have multiple members of your cluster on the same physical box. If you were to lose this machine then that could be bad for high availability…
This definitely looks promising although I don’t think it is something we would look at immediately as it adds quite a bit of complexity and I don’t think we would need that level of availability to begin with. Also, there is the small matter that this is just an idea and is not built yet. 🙂
I was chatting to fellow Aussie Adam Donnison at lunch yesterday and today I sat in on his session on running the MySQL websites. It was interesting to note the similarities of the mysql setup with our m3 infrastructure although he uses ldirectord/lvs for software load balancing (they used to use RR DNS!)
It was all fairly obvious stuff for a competent sysadmin though: separation of server functions, redundancy, minimising network traffic, handling failover with heartbeat. He did make a good point about redundancy though; that you should only pay for it where it is needed. So if you have designed your architecture to survive the loss of a machine, you don’t really need redundant power supplies or disks or anything on that box. You may as well just save the money.
In the future they plan to introduce a globally distributed setup; at the moment the servers are all in in one location. Listening to MySQL say they used to use RR DNS as load balancing and that they are planning a multi-site setup, and the amount of data that other sites like Digg (not flickr!) manage made me realise how sophisticated our M3 setup was (is). We had a global load balancing setup and managed a 1TB database driving 20 web servers.
Tobias Eggendorfer’s talk about German site Wundermedia was a bit weird. It was mostly about how they managed to scale their site with zero budget by writing their own caching engine and using some crazy jury-rigged replication scheme using SCP. Or maybe I dreamed that.
Andi Gutmans, who bears a striking resemblance to Mr Bean, gave a session on the state of PHP. Apparently 75% of new PHP apps are using PHP5. Another 75% statistic that was more surprising was the percentage of PHP developers who develop on Windows. He spent some time discussing the Zend Framework and I was interested to learn about its built in support for web services. Not just standard stuff like REST and SOAP but built-in support for Amazon, flickr, yahoo, etc. It also has full PDF read/write support. I would be interested to see how its API compares to the R&OS PDF class which is a bit nasty.
The reason they are moving towards an eclipse development platform is to gain tighter integration with Java. They support the Eclipse PDT project and v1.0 of this should be released in September.
Another really interesting PHP session was given by Mike Ho on creating web services in four lines using Qcodo. This is a really cool looking framework that has stuff like support for writing AJAX in PHP, ORM using code generation and some awesome WSDL generation abilities.
In his presentation he really did create web services that generated nasty big WSDL documents in a few lines of code. He created a simple web service class and then proceeded to access this from c# code running in Visual Studio in a Parallels desktop (an aside: there are a LOT of Macs here. At least half of the presentations were given on Macs). Here is an example:
require 'includes/prepend.inc.php'; class MyService extends QSoapService { /** * return string */ public function foo() { return 'foo'; } } MyService::run();
This code will generate the necessary WSDL to provide a SOAP service.
Ed Presz from Ticketmaster talked about how they use replication to scale. Some nice figures were:
- 14,000 tickets sold in a minute
- 500,000 tickets sold in a day
Again, replication works well for them because they have a high proportion of reads. They currently use MySQL 4.0 (they used to use MS SQL Server) but hope to move to 5.0 this year.
They use a large number of “chained” slaves. So a slave becomes a master to another slave. This gives them a lot of redundancy.
Interestingly they replicate across the Atlantic. Their servers in the UK log the transactions and then they are replicated across the pond for reporting via a 10mb WAN link. They standardise all their times in UTC.
Lots of people have been saying how setting innodb_flush_log_at_trx_commit=0 gives a huge improvement in performance – at most you can lose 1 sec of data. Usually this is worth it for the gain in speed.