MySQL Users Conference 2007 – Day 2

Leave a comment
Dev

Today was the first day of the main conference and at breakfast I met Daniel Bernholc from TradeDoubler‘s Stockholm office. It was interesting to hear about the TradeDoubler architecture. They do 25,000 RPS using only about 8 Sun Opteron web servers. All their application code is written in Java and they use a custom written Apache module for delivery.

Guy Kawasaki gave a very entertaining keynote called “The Art of Innovation” and Brian Stevens from Red Hat talked about the OLPC project.

The first session I went to was Rasmus talking about PHP performance and security. There is a new MySQL driver for PHP called MySQL ND (Native Driver) which should, in theory, give better performance but it seems in real applications this is not the case.

Rasmus used an example of a single Drupal page view to demonstrate performance tuning. Like Domas Mituzas, he also recommends using xdebug with kcachegrind for profiling. He also demonstrated using callgrind running via valgrind to provide input to the kcachegrind visualizer. He started apache under vallgrind:

valgrind --tool=callgrind --dump-instr=yes -v /usr/sbin/httpd -X
chmod a+rw callgrind.out.XXXX

A good tip when analysing kcachegrind output is if there are lots of boxes, that is generally a good thing. It means you are “distributing” your processing and no one part of your code is taking all the CPU time.

Also like Domas, Rasmus recommends the APC opcode cache.

The second part of Rasmus’s session was on PHP security. I found this very interesting as he went over the different types of XSS attacks and how to avoid them. You can check if a page is vulnerable to XSS by seeing if you can run alert(). If you can run an alert() you can run any javascript in the user’s browser. Basically, to avoid XSS problems you just need to make sure you do not display user data back to the user unfiltered. To help with this there is the filter PHP extension. This comes as standard in PHP 5.2 but can be used with 5.0/5.1 using PECL. XSS vulnerabilities are very common on the web and he showed us several examples of pages that were vulnerable including the IBM Developerworks article on how to prevent XSS atacks!

Rasmus then moved on to Cross Site Request Forgery (XSRF). Basically, this is when data is submitted to a page from an unauthorised source. The way around this kind of attack is to use what Rasmus calls “crumbs”. These are bits of data in the submission form that contain user-specific data.

The slides from his talk are here.

The session by two guys from digg.com about their architecture was interesting. They scale by having many web servers behind a load balancer talking to many MySQL slave instances. Each web server opens connections to a random DB server. A memcached server farm is used to only cache chunks of content, not whole pages. A lot of people seem to be talking about memcached at this conference and it is something I had heard of but never really looked at. It looks pretty neat but most of the people talking about it here have far more reads than writes to their database so I’m not sure how useful it would be for us.

Sharding is another technique that a lot of people seem to be implementing. Sharding is basically partitioning your database over a number of machines, breaking a big DB into smaller ones. But unlike, say, cluster, you need to manage the shards yourself in your application code. This can give performance improvements but it comes at the expense of more complicated code and you can’t do SQL joins across shards. Those are some pretty big downsides so it won’t suit everyone.

Again, they recommend using the APC opcode cache. I was surprised to hear that Digg only have about 30GB of data.

The SQL Antipatterns talk by Bill Karwin from Zend was also good. It was a quick run through common mistakes people make with their SQL:

Comma separated lists
This means storing a list of values in a varchar column and has a number of problems:

  • You can’t enforce referential integrity
  • It is hard to add a value to a list
  • It is difficult to search
  • You can’t even be sure the values in the list are integers

If you have this problem you probably have a many to many relationship which can be solved using an intersection table.

Ambiguous GROUP BY
This is the old problem of MySQL not enforcing the SQL standard for GROUP BY. You can use non-aggregated columns or calculations in the SELECT list that do not appear in the GROUP BY clause. This can cause some unexpected results, as the manual says:

Do not use this feature if the columns you omit from the GROUP BY part are not constant in the group. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.

You can solve this problem by either following the SQL standard or using the GROUP_CONCAT function.

Entity-Attribute-Value Tables
This is a common way to store things like user attributes or preferences but there are some problems with this approach:

  • There is no way to declare an attribute to be NOT NULL
  • You cannot enforce referential integrity
  • It is expensive to reconstruct the attributes in a single row

These problems can be solved by:

  • Only use it as a last resort
  • Enforcing constraints on attribute values in the application logic
  • Don’t fetch all the data as a single row causing big joins

Other tips were not letting users run their own ad-hoc SQL queries on the DB(!) and avoiding SQL injection by using prepared statements where values are replaced after the SQL has been parsed.

The performance tuning sessions are always popular and Peter Zaitsev is a smart guy. His session on server performance tuning was standing room only. He explained that you can get the most benefit out of tuning just a few parameters like innodb_buffer_pool_size.

Another interesting session was given by Farhan Mashraqi from Fotolog. According to Alexa this is one of the top 25 sites on the web with millions of members. It is different from something like flickr because users may only upload a single image per day (Gold members get 6).

Their site runs on about 30 Solaris servers divided into four clusters (there’s that sharding again). Each shard has a number of tables that were originally partitioned by username but this gave uneven distribution (which seems obvious if you think about it.) They use Akamai and I was interested to hear that they rewrote their entire site from PHP into Java. The reason the CTO gave was that the developers hated PHP…

Originally their site ran on MyISAM tables which predictably caused concurrency problems as their traffic grew but I was interested to hear that InnoDB will also create table locks when inserting an autoincrement primary key.

Farhan then went into some detail about how they managed to lower their I/O by increasing the number and reordering the columns referenced in their primary key. I want to go over his presentation again when it is posted on the conference site as he seemed to gain a huge performance increase from this but I suspect it is quite specific to their data.

Leave a Reply

Your email address will not be published. Required fields are marked *