Collations in MySQL 8.0

Similar documents
Google Search Appliance

Oracle Access Manager

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Microsoft Store badge guidelines. October 2017

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Virtual Blade Configuration Mode Commands

COM Text User Manual

USER GUIDE PUBLIC Document Version: SAP Translation Hub SAP SE or an SAP affiliate company. All rights reserved.

User guide on how to generate PDF versions of the product information - veterinary

Practical character sets

CONTENT. ANALYST OPINION INDICATOR for MT4. Set up & Configuration Guide. Description 1 Subscribing to TRADING CENTRAL feed 1 Installation process 1

Localizing Intellicus. Version: 7.3


DeskApp Admin Manual. Release 1.0 final. Kopano

InterKey 2.0 for Windows Mobile Pocket PC devices

Can R Speak Your Language?

Representing Characters, Strings and Text

ServiceAPI to the WorldLingo System

EU Terminology: Building text-related & translation-oriented projects for IATE

Representing Characters and Text

Transfer Manual Norman Endpoint Protection Transfer to Avast Business Antivirus Pro Plus

Models HP Engage One Top Mount 2x20 CFD (Black) HP Engage One Top Mount 2x20 CFD (White)

Nitti Mostro Designer Pieter van Rosmalen. OpenType PostScript (otf), TrueType (ttf), woff, eot

2. bizhub Remote Access Function Support List

Installation process Features and parameters Upgrade process... 6

UTF and Turkish. İstinye University. Representing Text

Hik-Connect Mobile Client

Licensed Program Specifications

SARCASTIC 1 OF 9 FONTS.BARNBROOK.NET

MaintSmart. Enterprise. User. Guide. for the MaintSmart Translator. version 4.0. How does the translator work?...2 What languages are supported?..

ipod touch 16GB - Technical Specifications

Guide & User Instructions

This manual describes utf8gen, a utility for converting Unicode hexadecimal code points into UTF-8 as printable characters for immediate viewing and

Heimat Didone Heimat Display Heimat Sans Heimat Mono Heimat Stencil

Bryant Condensed. From The Process Type Foundry

PROFICIENCY TESTING IN FOREIGN LANGUAGES

Transfer Manual Norman Endpoint Protection Transfer to Avast Business Antivirus Pro Plus

QuickSpecs. HP Retail Integrated 2x20 Display. Overview. Front. 1. 2X20 LCD, backlit display. 2. USB connector

American Philatelic Society Translation Committee. Annual Report Prepared by Bobby Liao

The Unicode Standard Version 11.0 Core Specification

Irma Slab. dui Typotheque type specimen & OpenType feature specification. Please read before using the fonts.

Organon Sans. a type specimen. 1

Hik-Connect Client Software V (Android) V (iOS) Release Notes ( )

Talk2You User Manual Smartphone / Tablet

Push button sensor 3 Plus - Brief instructions for loading additional display languages Order-No , , 2042 xx, 2043 xx, 2046 xx

ATypI Hongkong Development of a Pan-CJK Font

Formatting Custom List Information.

European Year 2012 for Active Ageing and Solidarity between Generations. Graphic guidelines

QuickSpecs. HP Graphical POS Pole Display. Models

Chevin Pro. a type specimen. 1

Heimat Didone Heimat Display Heimat Sans Heimat Mono Heimat Stencil

RELEASE NOTES UFED ANALYTICS DESKTOP SAVE TIME AND RESOURCES WITH ADVANCED IMAGE ANALYTICS HIGHLIGHTS

SourceOne. Products Compatibility Guide REV 61

SourceOne. Products Compatibility Guide REV 62

Heimat Didone Heimat Display Heimat Sans Heimat Mono Heimat Stencil

Multilingual Support Configuration For IM and Presence Service

Rescue Lens Administrators Guide

Perceptive Intelligent Capture Visibility

ADOBE READER AND ACROBAT 8.X AND 9.X SYSTEM REQUIREMENTS

Oracle. Talent Acquisition Cloud Using Scheduling Center. 17 (update 17.4)

Remote Ethernet Device, RED-1 - TIPS & TRICKS

This bulletin was created to inform you of the release of the new version 4.30 of the Epson EMP Monitor software utility.

0 OpenText RightFax. 16 EP4 Server Requirements

KIWI Smartphone FAQs V1.1 HUAWEI TECHNOLOGIES CO., LTD. Software Engineering Documentation Dept. Date December 2015

uptex Unicode version of ptex with CJK extensions

MSRP Price list & order form

INSITE Features Notes

Release Notes MimioStudio Software

Apple 64GB Wi-Fi ipad Mini 3, Model MGGQ2LL/A

QUICK REFERENCE GUIDE: SHELL SUPPLIER PROFILE QUESTIONNAIRE (SPQ)

LiveEngage System Requirements and Language Support Document Version: 5.0 February Relevant for LiveEngage Enterprise In-App Messenger SDK v2.

8 Parts and diagrams. Chapter contents. Ordering parts and supplies. Accessories. Covers. Internal components. Tray 2 pickup assembly

typography.net Shire types big, black, mixed up

Code Extension Technique Standard: ISO/IEC 2022

LiveEngage System Requirements and Language Support Document Version: 5.6 May Relevant for LiveEngage Enterprise In-App Messenger SDK v2.

COSC 243 (Computer Architecture)

Section Software

PAGE 1 SYSTRAN. PRESENTER: GILLES MONTIER

Localization: How do I translate Magento interface? Magento localization tips

Complete Messaging Solution

Oracle9i Database: The Power of Globalization Technology

SACD Text summary. SACD Text Overview. Based on Scarlet book Version 1.2. sonic studio

Net: EUR Gross: EUR

Printing Foreign Text Using BarTender

Multilingual Support Configuration For IM and Presence Service

Click-to-Call (Web RTC)

iphone 5 Specifications

Unicode and its discontents. Jeremy G. Kahn Machine Translation reading group 5 May 2008

Simple manual for ML members(mailman)

Net: EUR Gross: EUR

Net: PLN Gross: PLN

Corporate Design Manual October 2012 Basics BOBST DESIGN MANUAL CORPORATE. October, Version 1.9. Basics.

10 Steps to Document Translation Success

Recent Trends in Standardization of Japanese Character Codes

GV-Center V2 INTRODUCTION GV CENTER V2 VS. GV CENTER V2 PRO

IBM DB2 Web Query for System i V1R1M0 and V1R1M1 Install Instructions (updated 08/21/2009)

Category: Informational 1 April 2005

INSITE Features Notes

Height: 9.50 inches (241.2 mm) Width: 7.31 inches (185.7 mm) Depth: 0.37 inch (9.4 mm) Weight: 1.44 pounds (652 g) Height: 9.50 inches (241.

Transcription:

Collations in MySQL 8.0 Bernt Marius Johnsen Senior QA Engineer Warning: This presentation uses unicode graphemes, even for ellipsis (' ' U+2026)

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 2

Agenda 1 Why Unicode 2 What is character set/collation etc. 3 What s new in MySQL 8.0 4 How to migrate and some issues to consider 5 6 7 3

Why Unicode? The whole world is moving towards Unicode as digital devices is used by more and more people across all cultures all around the globe. 1 Approximate billion users of the six most used writing systems: Latin1: ~5, Chinese: ~1.5, Arabic: ~0.7, Devanagari: ~0.5, Cyrillic: ~0.25, Bengali: ~0.22, Kana: ~0.12 One driving force is Emojis Smileys, hearts, roses etc, and all the stuff people are sending to each other when communicating these days. ( ) Useful example: Unicode character U+1F574, MAN IN BUSINESS SUIT LEVITATING: This is way more letters than just ASCII! 4

Why Unicode in a database? You may use one character set for all your data, for all purposes. E.g. if you make an application, using utf8mb4 for a column with names, it may have Russian names, Chinese names, Japanese names etc. Even esoteric extinct writing systems are covered like e.g. the Phaistos disc (look it up...) But not Klingon, nor Tengwar 5

What is Unicode? Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Wikipedia) ISO/IEC 10646 Unicode covers most existing and extinct writing systems known to man in one standard. The standard has allocated 17 planes and blocks of characters are allocated into the planes Six planes assigned so far: Plane 0: U+0000 - U+FFFF: Basic Multilingual Plane (BMP) Plane 1: U+10000 - U+1FFFF: Supplementary Multilingual Plane (SMP) Plane 2: U+20000 - U+2FFFF: Supplementary Ideographic Plane (SIP) Plane 14: U+E0000 - U+EFFFF: Supplementary Special-Purpose Plane (SSP) Plane 15 & 16: U+F0000 U+10FFFF: Supplementary Private Use Area A and B (PUA-A and PUA-B) 6

What is a CHARACTER SET? A character set is defined by: A repertoire of characters/graphemes A value given to each character/grapheme (codepoint) An encoding which defines the binary representation of the values 7

What is Encoding? The binary representation of a character/grapheme. The simplest ones: 1:1. A character is a byte and a byte is a character (ASCII, ISO-8859-1/Latin-1 etc.) Unicode defines 3 encodings: UTF-8 (1-4 bytes per character) UTF-16 (2 or 4 bytes per character) UTF-32 (4 bytes per character) 8

Character set examples Character Character set Value Encoding Encoded as A ASCII ISO-8859-1 (Latin-1) Unicode 41 41 U+0041 1:1 1:1 UTF-8 UTF16 41 41 41 0041 Ä ISO-8859-1 (Latin-1) Unicode C4 U+00C4 1:1 UTF-8 UTF16 C4 C384 00C4 д KOI8-R ISO-8859-5 Unicode C4 D4 U+0434 1:1 1:1 UTF-8 UTF-16 C4 D4 D0B4 0434 人 GB-18030 Unicode C8CB U+4EBA Big5 JIS X 0208 (SJIS) A448 906C 1:1 UTF-8 UTF-16 1:1 1:1 C8CB E4BABA 4EBA A448 906C Unicode U+1F574 GB-18030 9439EE36 UTF8 UTF-16 1:1 F09F95B4 D83DDD74 9439EE36 9

What is collation Collation is the assembly of written information into a standard order (Wikipedia) Collation may consider Case (e.g 'A' vs. 'a') Accents (e.g. 'E' vs. 'É') Locale-specific rules (e.g. 'A' vs. 'Å' vs. 'AA' in Danish and Norwegian) Numeric characters (e.g. '2' vs. 'ⅱ') Punctuation (e.g. 'blackbird' vs. 'black-bird') Etc. 10

What is a COLLATION in (My)SQL? In (My)SQL, a COLLATION is a set of rules for a given character set which defines an order and affects: ORDER BY LIKE Primary keys and indexes Unique constraints Comparison operators Some string functions All strings in MySQL have a character set and a collation 11

Character sets in MySQL +----------+---------------------------------+---------------------+--------+ Charset Description Default collation Maxlen +----------+---------------------------------+---------------------+--------+ ascii US ASCII ascii_general_ci 1 latin1 cp1252 West European latin1_swedish_ci 1 utf8 UTF-8 Unicode utf8_general_ci 3 utf8mb4 UTF-8 Unicode utf8mb4_0900_ai_ci 4 Get all by typing: mysql> show character set; The rest of them are: armscii8, big5, binary, cp1250, cp1251, cp1256, cp1257, cp850, cp852, cp866, cp932, dec8, eucjpms, euckr, gb18030, gb2312, gbk, geostd8, greek, hebrew, hp8, keybcs2, koi8r, koi8u, latin2, latin5, latin7, macce, macroman, sjis, swe7, tis620, ucs2, ujis, utf16, utf16le, utf32 12

New in MySQL 8.0 Default character set: utf8mb4 with default collation: utf8mb4_0900_ai_ci Three language independent collations: utf8mb4_0900_ai_ci, utf8mb4_0900_as_ci, utf8mb4_0900_as_cs 1 may be used for German dictionary order, English, French1, Irish Gaelic, Indonesian, Italian, Luxembourgian, Malay, Dutch, Portuguese, Swahili and Zulu A lot of new collations based on Unicode v. 9.0.0 UCA (Unicode Collation Algorithm) DUCET (Default Unicode Collation Entry Table) CLDR v.30 (Common Locale Data Repository) All utf8mb4_*_0900_* collations are NO PAD ) Canadian French may not use utf8mb4_0900_as_cs/utf8mb4_0900_as_ci collations due to differences to standard accent order. 13

New in MySQL 8.0 We have gone to great lengthts to make the new utf8mb4_*_0900_* collations correct and complete. Accent insensitive/case insensitive (ai_ci) and accent sensitive/case sensitive (as_cs) utf8mb4 collations have been implemented for: Classical Latin (la), Croatian (hr), Czech (cs), Danish/Norwegian (da), Esperanto (eo), Estonian (et), German phone book order (de_pb), Hungarian (hu), Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Modern Spanish (es), Traditional Spanish (es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi) Accent/case sensitive (as_cs) and accent/case/kana sensitive (as_cs_ks) utfmb4 collations for: Japanese (ja) 14

MySQL 8.0 collation name scheme <charset>[_<language> [_<variant>]]_<unicodeversion>(_<attribute>)+ <charset> = utf8mb4 <language>, an ISO 639-1 language code (or ISO 639-2 if needed) <variant>, a variant to the standard collation for the language. Per today: utf8mb4_de_pb_0900_* and utf8mb4_es_trad_0900_*. <unicodeversion> = 0900 <attribute>: accent sensitivity (ai, as), case sensitivity (ci, cs), kana sensitivity (ks) and possible future ones. 15

Why not... Fix utf8mb4_general_ci instead of introducing utf8mb4_0900_ai_ci or fix utf8mb4_german2_ci instead of introducing utf8mb4_de_pb_0900_ai_ci? Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be rebuilt). Our policy: Collations don't change! Have a simpler name scheme? Because we prepare for More languages New Unicode versions (Unicode 10.0.0 is expected in 2018) ISO-639-1/ISO-639-2 language codes are well defined 16

How to migrate? When migrating from 5.7 tables: Just convert the table: ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4; This will change the default character set of the table (so that future new columns get utf8mb4) and the character set of all applicable columns. In theory, all character data in MySQL may be converted to utf8mb4 without loss of data. That was easy... is that all to it...? 17

Upgrading to MySQL 8.0 When upgrading to 8.0: Schemas (databases) keep their default character set/collation. Tables keep their default character set/collation. Columns keep their character set/collation To take advantage of utfmb4, you need to migrate. 18

not quite column by column If you have more complex tables with different character sets: Change the default character set of the table: ALTER TABLE foo DEFAULT CHARACTER SET utf8mb4; Modify all relevant relevant columns: ALTER TABLE foo MODIFY bar VARCHAR(100) CHARACTER SET utf8mb4; Generally we recommend doing it column by column. ALTER TABLE CONVERT will e.g. change TEXT to MEDIUMTEXT when you convert from latin1 to utf8mb4 and that won't necessarily be what you want. 19

not quite the schema too A schema (aka. database) in MySQL has a default character set which will be the default character set of new tables in the schema mysql> show create schema bar; +----------+----------------------------------------------------------------+ Database Create Database +----------+----------------------------------------------------------------+ bar CREATE DATABASE `bar` /*!40100 DEFAULT CHARACTER SET latin1 */ +----------+----------------------------------------------------------------+ 1 row in set (0.00 sec) Change the default character set of the schema(database): ALTER SCHEMA bar DEFAULT CHARACTER SET utf8mb4; 20

not quite collation differences Collations are not equal, so converting from one collation to another may break UNIQUE constraints (e.g PRIMARY KEY). Default collation: latin1_swedish_ci vs. utf8mb4_0900_ai_ci E.g. 'o'='ö' is false in the first, but true in the other. Possible solution: Stick to Swedish or another suitable collation depending on your application: ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_sv_0900_ai_ci; Generally, if you don't care about case insensitivity (just got it by default), utf8mb4_0900_as_cs should be safe. There's an huge number of possibilities depending on your data and the collations used, partly because pre MySQL 8.0 collations where not complete (and in some cases not correct). 21

not quite index and key issues If you change the collation of a column, indexes on that column will be regenerated. This takes time for large data, and the table is locked during that time. And the conversion may fail due to changed space consumption. Max key length is 3072 bytes1, which implies that max length of a utf8mb4 varchar column which is also a key is 768 characters (Worst case scenario: 4 bytes per character). 1 mysql> create table foo (v varchar(1000) character set latin1 primary key); Query OK, 0 rows affected (0.01 sec) mysql> alter table foo modify v varchar(1000) character set utf8mb4; ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes For default InnoDB row format and default innodb_page_size in MySQL 8.0. See the documentation for details. 22

Upgrade example mysql> show create table cities; +--------+--------------------- Table Create Table +--------+--------------------- cities CREATE TABLE `cities` ( `name` varchar(1024) NOT NULL, `population` int(11) DEFAULT NULL, PRIMARY KEY (`name`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 +--------+---------------------- mysql> select * from cities; +------------+------------+ name population +------------+------------+ København 1246611 Orebro 107380 Oslo 666759 Stockholm 935619 Örebro 107380 +------------+------------+ 5 rows in set (0.00 sec) 1 row in set (0.00 sec) mysql> alter table cities modify column name varchar(1024) charset utf8mb4; ERROR 1062 (23000): Duplicate entry 'Örebro' for key 'PRIMARY' 23

Upgrade example contd. mysql> alter table cities modify column name varchar(768) charset utf8mb4; Query OK, 4 rows affected (0.01 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> insert into cities values(' 東京 ',13617445); Query OK, 1 row affected (0.00 sec) mysql> select * from cities; +------------+------------+ name population +------------+------------+ København 1246611 Oslo 666759 Örebro 107380 Stockholm 935619 東京 13617445 +------------+------------+ 6 rows in set (0.00 sec) 24

文字化け (Mojibake) ( or what you see is not what you get...) mysql> create table foo(v varchar(10) character set latin1); mysql> insert into foo values('å'); mysql> set names latin1; mysql> insert into foo values('å'); mysql> set names utf8mb4; mysql> select * from foo; +------+ mysql> select hex(v) from foo; v +--------+ +------+ hex(v) å +--------+ E5 Ã C3A5 +------+ +--------+ 2 rows in set (0.00 sec) 2 rows in set (0.00 sec) 25

Fixing à mysql> select v from foo; +-------------------------------+ v +-------------------------------+ à +-------------------------------+ 1 row in set (0.01 sec) mysql> update foo set v=convert(convert(convert(v using binary) using utf8mb4) using latin1) ; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select v from foo; +--------------+ v +--------------+ å +--------------+ 1 row in set (0.00 sec) 26

Fixing æ å åœã mysql> select v from foo; +-------------------------------+ v +-------------------------------+ æ å åœã +-------------------------------+ 1 row in set (0.01 sec) mysql> alter table foo modify column v varchar(128) charset binary; Query OK, 1 row affected (0.14 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> alter table foo modify column v varchar(128) charset utf8mb4; Query OK, 1 row affected (0.14 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> select v from foo; +--------------+ v +--------------+ 文字化け +--------------+ 1 row in set (0.00 sec) 27

Space consumption utf8mb4 use 1 byte for ASCII characters (U+0000 - U+007F), 2 bytes for most alphabets/abjads (U+0080 - U+07FF), 3 bytes for Indic scripts, Hangul, Kana, the most used CJK Ideographs (U+0800 - U+FFFF), 4 bytes for the rest: Archaic scripts, Emojis, Rarely used CJK extensions, Variant selectors etc. (U+10000 -) 28

Speed issues Operations on multibyte character sets inherently slower than singlebyte character sets (e.g. latin1 vs. utf8mb4) We have done a lot of code improvements. New code for the new utf8mb4 collations New collations are NO PAD (which gives faster algorithms) But expect a performance degradation in the order of 10-20% for sorting when you migrate from e.g latin1 to utf8mb4, depending on your data of course. Some collations are inherently slower than others (e.g. utf8mb4_0900_ai_ci vs. utf8mb4_ja_0900_as_cs_ks) 29

Truly usable for global purposes... 30

Q&A Check out my blogs at http://mysqlserverteam.com/author/bernt/ The 8.0 documentation (if everything else fails ) https://dev.mysql.com/doc/refman/8.0/en/charset.html The Unicode documents (for those truly interested ) http://unicode.org/ U+1F634 31