Success story "Yandex.Email" PostgreSQL



the Mr Borodin (on "Habre" dev1ant), a system group administrator operation of systems of data storage "Yandex.The mail" introduces the complexities of a major migration project from Oracle Database to PostgreSQL. This is the transcript of the report from the conference HighLoad++ 2016.

Hello! My name is Dylan, today I am going to talk about the database "Yandex.Mail."

First, a few facts that will be important in the future. "Yandex.Mail" — the service is quite old: it was launched in 2000 and we have accumulated lots of legacy. We have — as is customary and fashionable to say — quite a highload service, more than 10 million users a day, some hundreds of millions of all. In the backend we arrives more than 200 thousand requests per second at peak. We put more than 150 million letters a day that have passed the check for spam and viruses. The total volume of letters for 16 years — more than 20 petabytes.

What will be discussed? About how we moved the metadata from Oracle to PostgreSQL. Metadata there is not petabytes — they are a little more than three hundred terabytes. In base flies more than 250 thousand queries per second. It should be borne in mind that this is a small OLTP queries are mostly reads (80%).

This is not our first attempt to get rid of Oracle. In the beginning there was an attempt to move to MySQL, it failed. In 2007 or 2008 there was an attempt to write something of my own, she too failed. In both cases it was a failure not so much for technical reasons as for organizational.

What is metadata? Here they are highlighted by arrows. Are the folders that represent some sort of hierarchy with counters, markers (too, in fact, lists with counters), assemblers, threads and, of course, writing.



We do not keep the letters themselves in metabase, the body of the emails are in a separate repository. In metabasic we stored the envelopes. Envelopes is some email headers: from, to, subject, date and things like that. We store information about attachments and emails.



the

Back in 2012


All this was in Oracle. We had a lot of logic in most of the stored database. Orekhovye base was the most effective iron utilization: we piled a lot of data on the shard, more than 10 terabytes. Relatively speaking, when 30 kernels we have a normal working load average was 100. It is not when everything is bad, and during normal operation.

Bases was short, so much was done manually, without automation. There was a lot of manual operations. To save we divided the base in "warm" (75%) and "cold" (25%). "Warm" is for active users, they are SSD. "Cold" for inactive users with SATA.

Sortirovanie and resiliency is an important topic in "Yandex". Sortirovanie — because one shard is all stuffed, and fault tolerance — because we regularly take on and off one of our data centers to see that everything works.

As it was implemented? We have an internal service BlackBox (black box). When a request arrives on one of our backends, a backend exchange authentication data — a username, password, cookie, token, or something like that. It is with this in BlackBox, which, if successful, gives the user ID and the name of the shard.



Then the backend feed is the name of the shard in orlovy driver OCCI, on in this driver was implemented the whole logic of fault tolerance. That is, roughly speaking, in a special file /etc/tnsnames.ora was recorded shardname and a list of hosts that this shard enter his service. Oracle to decide which of them is the master, who remark who's alive, who's dead, etc. total sortirovanie was implemented by means of external tools, and fault tolerance — means the Oracle driver.
A large part of the backend was written in C++. In order not to produce "bikes", they have a long time there was a General abstraction meta macs access. It's just an abstraction for walking in the base. Almost all the time she was in one implementation macs_ora to walk directly in Oracle. At the very bottom, of course, OCCI. Was still a small layer dbpool that implemented a connection pool.



Here is once upon a time thought, I was dizaineres and implement it. Over time the abstraction leaked, the backend started to use the methods of implementation macs_ora worse if from dbpool. Java came and all sorts of other backends, which could not use this library. All the noodles then had to painfully clean up.

Oracle is a great database, but it had problems. For example, display the PL/SQL code is a pain, because there is a library cache. If the database is under load, you can't just update the function code, which is now used by some sessions.

The other problems are connected not so much with Oracle, but with the approach that we used. Again: lots of manual operations. Switching artists, the infusion of new bases, running transfers, user — everything was done by hand, so it was a little.

From the point of view of development there is a drawback that made [C++] Orekhovy driver only has a synchronous interface. That is a normal asynchronous backend over write does not work. This caused some pain in the development. The second pain in the development caused that to raise test database is problematic. First, because hands, and secondly, because it's money.

No matter what anyone said, the support from Oracle is. Although support for enterprise-companies are often far from ideal. But the main reason for the switch is money. Oracle is expensive.

the

Timeline


In October 2012, over 4 years ago, it was decided that we should get rid of Oracle. Sounds of words PostgreSQL, not heard any technical details- it was a purely political decision: to get rid of a period of 3 years.

Six months later we began the first experiments. What was spent the last six months, I can later tell. These six months have been important.

In April 2013, we experimented with PostgreSQL. Was then very fashionable trend for all sorts of NoSQL solutions, and we tried different things. We remembered that we have all the metadata already stored in the backend of search, email, and maybe you can use it. This solution is also tried.

The first successful experiment was with the collector, I he told on mitape in "Yandex" in 2014.

We took a small piece (2 terabyte) rather loaded (40 thousand requests per second) post metadata and carried them from Oracle to PostgreSQL. The piece, which is not very related to core metadata. We did, and we liked it. We decided that PostgreSQL is our choice.

Next, we Jerry-rigged a prototype of the post schema for PostgreSQL and starting to put it all in a stream of letters. We did it asynchronously: all 150 million messages per day, we piled into PostgreSQL. If the shim failed, we wouldn't care. It was a pure experiment, production he touched.

This enabled us to test initial hypotheses with the scheme. When there is data that is not a pity to throw out — it is very convenient. Did some scheme, stuffed in her letters, saw that does not work, dropped everything and began again. Great information that can dropati, we are love.

Also because of this turned to some extent to load-test directly under a live load, and not some synthetic, not on separate stands. It happened to make initial estimations of iron, which is needed for PostgreSQL. And of course, experience. The main purpose of the previous experiment and prototype experience.
Then began the main work. Development took about a year of calendar time. Long before it ended, we moved from Oracle to PostgreSQL your boxes. We always knew that there will never be such that we all show for one night only "sorry technical work", will carry over 300 terabytes and start working on PostgreSQL. It does not happen. We would have broken, rolled back, and everything would be bad. We understand that there will be quite a long period of time when some of the boxes will live in Oracle, and PostgreSQL, will go slow migration.

In the summer of 2015 we moved their crates. The command "Mail" that writes, tests, admin and so on, have moved their boxes. This greatly accelerated the development. Suffering abstract Bob, or you're suffering, but unable to fix it, are two different things.

Even before we finished and implemented all the features, we started to carry the inactive users. We call such an inactive user who the mail comes, we fold the letters, but he reads neither the web nor mobile, nor IMAP him, they are not interested. There are some users, unfortunately. We started to carry such inactive users, when for example, we have not yet fully been implemented IMAP, or didn't work half of the handles in the mobile app.

But it's not because we are courageous and decided all the boxes to break, and because we had a plan B in the form of back migration, and it helped us greatly. There was even automation. If we moved a user and he suddenly tried, for example, to log in to the web interface — woke up and became active back we transferred it into Oracle, in order not to break his no features. This allowed us to fix a bunch of bugs in the code transfer.

Then followed the migration. Here are some interesting facts. 10 man-years it took us to rewrite all our noodles that we have accumulated over 12-15 years.

Migration itself if it was very fast. This schedule for 4 months. Each line is the percentage of load that the service pays out of PostgreSQL. Broken services: IMAP, web, POP3, shim, mobile and so on.



Unfortunately, the gap to jump to 95% is impossible. We all are unable to move by April as the registration remains in Oracle, it is quite a complex mechanism. It so happened that we have registered new users in Oracle and on that same night they were transferred to a PostgreSQL. In may we washed down the registration, and in July already repaid all of the Oracle database.



the

Main changes


In our abstraction has another implementation macs_pg, and we've cleared up all the noodles. All those leaking abstraction had to be careful to rewrite it. At the bottom of her libpq, took another small layer of the apq that implement the connection pooling, timeouts, error handling, and all this asynchronously.



Sortirovanie and fault tolerance — all the same. The backend receives credentials from a user, exchanges them with the BlackBox for the user ID and the name of the shard. If the name of a shard is lettered pg, then it makes another request to the new service, which we call a Sharpei. The backend sends back the ID of the user, and the mode in which he wants to base. For example, "I want wizard", "I want a synchronous replica," or "I want the nearest host". Sharpei he returns the connection string. Further, the backend opens a connection, holds it and uses it.

To know the information who is the master who remark who's alive, who's dead, who's behind, who's not, Sharpei times per second runs in the target database and asked their status. In this place there is a component that took both functions: and sortirovanie, and resiliency.



In terms of iron we made a few changes. Since Oracle is licensed per processor core, we were forced to scale vertically. On one CPU core we've stuffed a lot of memory, much faster SSD. There was a small number of bases with a small number of cores but with massive amounts of memory and disk. We have always been strictly one replica for failover, because all subsequent is money.
In PostgreSQL we have changed the approach. We began to make the database smaller and two replicas in each shard. This allowed us to wrap the reading of the load on the replica. That is, in Oracle, all served with the master, and PostgreSQL — three cars instead of two smaller, and the reading is wrapped in PostgreSQL. In the Oracle case, we scaled vertically, in the case of PostgreSQL massturbate horizontally.



In addition to the "warm" and "cold" baz appeared and "hot". Why? Because we suddenly found that 2% of the active users are creating 50% of the load. There are bad users that we are raped. For them we made a separate base. They are not much different from the warm, there is also SSD, but less than on one CPU core, because the CPU is more actively used there.

Of course, we washed down the automation migrate users between shards. For example, if the user is inactive, now lives in Stalnoy [SATA drive] base and started to use IMAP, we moved to "warm" base. Or if he warm base of six months does not move, we will move to the "cold".

Moving old emails active users from SSD to SATA — it's something we really want to do but can't yet. If you're an active user, I live on SSD and you have 10 million emails, they all lie on the SSD that is not very efficient. But while that is normal in PostgreSQL partitioning.

We changed all the IDs. In the case of Oracle they were globally unique. We had a private base, where it was written that in this shard such ranges, that are. Of course, we had fakap when, because of error crossed the identifiers and their uniqueness was tied around half. That hurt.

In the case of PostgreSQL, we decided to switch to the new scheme when we have IDs unique within a single user. If before was a unique mid identifier of the message, then now is a pair uid unique mid. All the plates we have the first uid field, they all preissiana, it is part is always.

Except that it is less space, there is another obvious plus. Since all these identifiers are taken from sequences, we have less competition for the last page of the index. In Oracle we to resolve this problem cracial reverse indexes. In the case of PostgreSQL because the inserts go to different pages of the index, we use the conventional B-Tree, and we have a range-scans, all the data of one user in the index are adjacent. It is very convenient.

We introduced audits for all entities. It allowed to read from replicas, the first, unexpired data, and secondly, an incremental upgrade to IMAP mobile. That is the answer to the question "what has changed in this folder with such audit" due to this much easier.

In PostgreSQL everything is good with arrays, composites. We did the denormalization of the data. Here is one example:



This is our basic plate mail.box. It contains one row for each letter. Primary key it is a pair of uid mid. Yet there is an array of labels lids, because one letter may have more than one label. Thus the challenge is to answer the question "give me all emails with that label". Obviously, this requires some kind of index. If you build a B-Tree index on the array, it will not respond to this question. To do this, we used a tricky functional gin index on the field uid and lids. It allows us to answer the question "give me all letters such that a user with those labels or with so-and-so label".

the

Stored logic


the

    Because Oracle had a lot of pain with the stored logic, we have sworn never to PostgreSQL stored logic not at all, no. But in the process of our experiments and prototypes, we realized that PL/pgSQL is very good. He does not suffer a problem with the library cache, and we didn't find much other critical issues. the

  • Since there is no undo, the cost of failure was much higher. In undo, we climb a couple of times after the bad code, about this is my colleague Alexander doing a separate report we have on mitape.
  • the
  • because of the lack of library cache, it is much easier diploidy. We ride through a couple of times a week instead of once a quarter as before.

the

services


the
    the
  • Because we changed the hardware and began to scale horizontally, and the approach to support the bases we have changed. Bases we now rule SaltStack. His main killer feature for us is the ability to see a detailed diff between what now are on base and what is expected of it. If you are satisfied, then the person clicks the "roll out" and it rolls.
  • the
  • code Scheme, and we now change the after migration. We had a separate report about it.
  • the
  • From a manual service, we are gone, all that is possible, automated. Switching masters, transfers users, cordials new shards and so on — all this is on the button and very simple.
  • the
  • Because to deploy a new database is a one button, we got a representative test environment for development. Each developer on the basis of two as you like — it's very convenient.

the

Problem


Such things do not go smoothly ever.

This is the list of threads in the community with the problems that we alone could not solve.

the
    the
  • Problem with ExclusiveLock on inserts
  • the
  • Checkpoint distribution
  • the
  • ExclusiveLock on extension of relation with huge shared_buffers
  • the
  • Hanging the startup process on the replica after vacuuming on the master
  • the
  • Replication slots and isolation levels
  • the
  • Segfault in BackendIdGetTransactions

So we went to the community, and helped us. It was a test, what to do when you have no enterprise support: is community, and it works. And it's great. Of course, the much more problems we decided to do.

For example, we had a very popular joke: "In any unclear situation to blame autovacuum". These problems we also figured it out.

We lacked ways to diagnose PostgreSQL. The guys from Postgres Pro gash us wait interface. This is what I have told at PG Day in 2015 St. Petersburg. There you can read how it works. With the help of the guys from EnterpriseDB Postgres Pro and it's included in the kernel 9.6. Not all, but some part of these developments came in 9.6. Further, this functionality will improve. 9.6 appeared columns, which allow much better understanding of what is happening in the database.

Surprise. We encountered a problem with backups. We have a recovery window of 7 days, that is, we should be able to recover at any time in the past during the past 7 days. In Oracle, the spot size for all backups and archivelog was roughly equal to database size. Base 15 terabytes — and its backup for 7 days is 15 terabytes.

In PostgreSQL, we use barman, and under the backups need at least 5 times larger than the size of the base. Because WAL are compressed, but the backups are not there, there is File-level increments, which did not really work, in General, all single-threaded and very slow. If we makapili as is the 300 terabytes of meta-data, we would need approximately 2 petabytes under backups. Let me remind you, just a store "Addresses" — 20 petabytes. That is 10%, we would have to cut just under the backups of the meta-database during the last 7 days, which is a pretty bad plan.

We have not invented anything better and zapattini barman, pull request. Almost a year has passed since we ask them to gash this killer feature, and they ask money from us to samarjit her. Very arrogant guys. My colleague, Eugene, who all this and gash, talked about this on PG day in 2016. It is really much better shakes backups, accelerates them, there are honest increments.

The experience of the experiment, prototype, other databases, which at that time appeared on PostgreSQL, we expected a lot of rake during the transfer. And they were not. Had a lot of problems, but in PostgreSQL they were not connected, that was amazing to us. Had a lot of problems with the data, because for 10 years I have accumulated a lot of legacy. Suddenly discovered that in some databases the data are in KOI8-R encoding, or other strange things. Of course, there were mistakes in the logic of the transfer, so the data had to fix.
the

the End


There are things that we are lacking in PostgreSQL.

For example, partitioning to move old data from SSD to SATA. We lack good built-in recovery manager to not use a fork of batman, because the core of the barman it's probably not gonna make it never. We're almost a year they kick, and they are not very speedy. Think this should be in the side of PostgreSQL, namely in the kernel.

We will develop a wait-interface. I think in 10-second versions will happen quourum commit, there is a patch in good condition. We really want normal operation of a disk. In terms of disk I/O loses badly PostgreSQL and Oracle.

What is the result? When you consider the raids of the replica that we have in PostgreSQL more than 1 petabyte. Recently I thought, there's a little more than 500 billion rows. There flies 250 thousand queries per second. Only it took us 3 calendar years, but we spent more than 10 man-years. That is, the effort of the whole team is quite impressive.

What have we got? Faster deployment, despite the fact that the database has become much larger, and the number of DBA decreased. DBA on this project is now less than when I was an Oracle.

Wanted we or not, but we had prefactoring all the code backend. All the legacy that we have accumulated for years pipelines. Our code is now cleaner and is very good.

Without spoonfuls of tar does not happen. We now have 3 times more iron under PostgreSQL, but it is nothing compared with the cost of Oracle. While we have not had a major fakapov.

A little note from me. In the "Mail" we use many open source libraries, projects and turnkey solutions. To the three chairs on which we sat tight, we are almost everywhere — Linux, nginx, postfix — added PostgreSQL. Now we use it under a base in other projects. We liked him. Fourth — a good, reliable chair. I think this is a success story.

I have all the. Thank you!


the Vladimir Borodin — a success Story "Yandex.Mail" with PostgreSQL
Article based on information from habrahabr.ru

Популярные сообщения из этого блога

Approval of WSUS updates: import, export, copy

Kaspersky Security Center — the fight for automation

The Hilbert curve vs. Z-order