It will be about optimization in MySQL database.
This happened when we made a system for email newsletters. Our system was supposed to send tens of millions of letters per day. Sending a letter is not an easy task, although everything looks pretty primitive:
- Collect a letter from html creative, substitute personalized data.
- Add a pixel for viewing the letter, replace all links in the letter with your own – to track clicks.
- Check before sending that the email is not in the black list.
- Send an email to a specific pool.
I will tell you more about the second paragraph:
Microservice mail-builder is preparing a letter for sending:
- finds all links in the letter;
- a unique 32-character uuid is generated for each link;
- replaces the original link with a new one and saves the data in the database.
Thus, all source links will be replaced with uuid, and the domain will be changed to ours. When you get a GET request using this link, we proxy the original image or redirect to the original link. Saving occurs in the MySQL database, we save the generated uuid along with the original link and with some meta information (user email, mailing id and other data). Denormalization helps us in 1 request to get all the necessary data to save statistics, or start some kind of trigger chain.
Problem number 1
The generation of our uuid was depended on timestamp.
Since mailings usually occur in a certain period of time and many instances of microservice for assembling a letter are launched, it turned out that some of the uuids were very similar. UPD: because the data was similar, working with the bi-tree was not very effective.
We solved this problem using the uuid module in python, where there is no time dependence.
Such an implicit thing reduced the speed of indexes.
What about the storage?
The structure of the table was next:
CREATE TABLE IF NOT EXISTS `Messages` ( `UUID` varchar (32) NOT NULL, `Message` json NOT NULL, `Inserted` DATE NOT NULL, PRIMARY KEY (`UUID`) ) ENGINE = InnoDB DEFAULT CHARSET = utf8;
At the time of creation, everything looked logical:
UUID is a primary key, and is also a clustered index. When we make a selection on this field, we simply select the record, because all values are stored right there. This was a deliberate decision.
Everything was wonderful until the table grew.
Problem number 2
If you read more about the cluster index, you can find out about this nuance:
When adding a new row to the table, it is added not to the end of the file, not to the end of the flat list, but to the desired branch of the tree structure corresponding to it by sorting.
Thus, with increasing load, insertion time will be increased.
The solution was to use a different table structure.
CREATE TABLE IF NOT EXISTS `Messages` ( `ID` int (11) NOT NULL AUTO_INCREMENT, `UUID` varchar (32) NOT NULL, `Message` json NOT NULL, `Inserted` DATE NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY `UUID` (` UUID`, `Inserted`) ) ENGINE = InnoDB DEFAULT CHARSET = utf8;
Since the primary key is now auto-incrementing, and mysql stores the cache of the last insertion place, now the insertion always occurs at the end, i.e. Innodb is optimized for writing sequentially increasing values.
I found the details of this optimization in the postgres source code. Mysql implements very similar optimization.
Of course, I had to add a unique key, but we increased the insertion speed.
With even greater growth in the database, we thought about deleting old data. Using DELETE on the Inserted field is absolutely not optimal – this is a very long time, and the place will not be freed until we execute the optimize table command. By the way, this operation completely blocks the table – this didn’t suit us at all.
Therefore, we decided to split our table into partitions.
1 day – 1 partition, the old ones drop automatically when the time comes.
Problem number 3
We got the opportunity to delete the old data, but we did not get the opportunity to choose from the desired partition, because with select we specify only uuid, mysql doesn’t know in which partition we should look for it and look in all.
The solution was born from Problem # 1 – add a timestamp to the generated uuid. Only this time we did a little differently: we inserted a timestamp in a random place on the line, not at the beginning or at the end; before and after was added the dash symbol so it can be obtained with a regular expression.
With this optimization, we were able to get the date when the uuid was generated and already make a select indicating the specific value of the Inserted field. Now we read the data immediately from the partition we need.
Also, thanks to things like ROW_FORMAT = COMPRESSED and changing the encoding to latin1, we saved even more space on the hard drive.