PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc.
Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit of History 4 Big Data 7 Scalability 9 Definition and introduction 10 Sorted Ordered Column-Oriented Stores 11 Key/Value Stores 14 Document Databases 18 Graph Databases 19 Summary 20 CHAPTER 2: HELLO NOSQL: GETTING INITIAL HANDS-ON EXPERIENCE 21 First Impressions Two Simple Examples 22 A Simple Set of Persistent Preferences Data 22 Storing Car Make and Model Data 28 Working with Language Bindings 37 MongoDB's Drivers 37 A First Look at Thrift 40 Summary 42 CHAPTER 3: INTERFACING AND INTERACTING WITH NOSQL 43 If No SQL, Then What? 43 Storing and Accessing Data 44 Storing Data In and Accessing Data from MongoDB 45 Querying MongoDB 49 Storing Data In and Accessing Data from Redis 51 Querying Redis 56 Storing Data In and Accessing Data from HBase 59 Querying HBase 62
Storing Data In and Accessing Data from Apache Cassandra 63 Querying Apache Cassandra 64 Language Bindings for NoSQL Data Stores 65 Being Agnostic with Thrift 65 Language Bindings for Java 66 Language Bindings for Python 68 Language Bindings for Ruby 68 Language Bindings for PHP 69 Summary 70 CHAPTER 4: UNDERSTANDING THE STORAGE ARCHITECTURE 73 Working with Column-Oriented Databases 74 Using Tables and Columns in Relational Databases 75 Contrasting Column Databases with RDBMS 77 Column Databases as Nested Maps of Key/Value Pairs 79 Laying out the Webtable 81 HBase Distributed Storage Architecture 82 Document Store Internals 85 Storing Data in Memory-Mapped Files 86 Guidelines for Using Collections and Indexes in MongoDB 87 MongoDB Reliability and Durability 88 Horizontal Scaling 89 Understanding Key/Value Stores in Memcached and Redis 90 Under the Hood of Memcached 91 Redis Internals 92 Eventually Consistent Non-relational Databases 93 Consistent Hashing 94 Object Versioning 95 Gossip-Based Membership and Hinted Handoff 96 Summary 96 CHAPTER 5: PERFORMING CRUD OPERATIONS 97 Creating Records 97 Creating Records in a Document-Centric Database 99 Using the Create Operation in Column-Oriented Databases 105 Using the Create Operation in Key/Value Maps 108
Accessing Data 110 Accessing Documents from MongoDB 111 Accessing Data from HBase 112 Querying Redis 113 Updating and Deleting Data 113 Updating and Modifying Data in MongoDB, HBase, and Redis 114 Limited Atomicity and Transactional Integrity 115 Summary 116 CHAPTER 6: QUERYING NOSQL STORES 117 Similarities Between SQL and MongoDB Query Features 118 Loading the MovieLens Data 119 MapReduce in MongoDB 126 Accessing Data from Column-Oriented Databases Like HBase 129 The Historical Daily Market Data 129 Querying Redis Data Stores 131 Summary 135 CHAPTER 7: MODIFYING DATA STORES AND MANAGING EVOLUTION 137 Changing Document Databases 138 Schema-less Flexibility 141 Exporting and Importing Data from and into MongoDB 143 Schema Evolution in Column-Oriented Databases 145 HBase Data Import and Export 147 Data Evolution in Key/Value Stores 148 Summary 148 CHAPTER 8: INDEXING AND ORDERING DATA SETS 149 Essential Concepts Behind a Database Index 150 Indexing and Ordering in MongoDB 151 Creating and Using Indexes in MongoDB 154 Compound and Embedded Keys 160 Creating Unique and Sparse Indexes 163 Keyword-based Search and MultiKeys 164 Indexing and Ordering in CouchDB 165 The B-tree Index in CouchDB 166 Indexing in Apache Cassandra 166 Summary 168 xi
CHAPTER 9: MANAGING TRANSACTIONS AND DATA INTEGRITY 169 RDBMS and ACID 169 Isolation Levels and Isolation Strategies 171 Distributed ACID Systems 173 Consistency 174 Availability 174 Partition Tolerance 175 Upholding CAP 176 Compromising on Availability 179 Compromising on Partition Tolerance 179 Compromising on Consistency 180 Consistency Implementations in a Few NoSQL Products 181 Distributed Consistency in MongoDB 181 Eventual Consistency in CouchDB 181 Eventual Consistency in Apache Cassandra 183 Consistency in Membase 183 Summary 183 CHAPTER 10: USING NOSQL IN THE CLOUD 187 Google App Engine Data Store 188 GAE Python SDK: Installation, Setup, and Getting Started 189 Essentials of Data Modeling for GAE in Python 193 Queries and Indexes 197 Allowed Filters and Result Ordering 198 Tersely Exploring the Java App Engine SDK 202 Amazon SimpleDES 205 Getting Started with SimpleDB 205 Using the REST API 207 Accessing SimpleDB Using Java 211 Using SimpleDB with Ruby and Python 213 Summary 214 CHAPTER 11: SCALABLE PARALLEL PROCESSING WITH MAPREDUCE 217 Understanding MapReduce 218 Finding the Highest Stock Price for Each Stock 221 Uploading Historical NYSE Market Data into CouchDB 223 xil
MapReduce with HBase 226 MapReduce Possibilities and Apache Mahout 230 Summary 232 CHAPTER 12: ANALYZING BIG DATA WITH HIVE 233 Hive Basics 234 Back to Movie Ratings 239 Good Old SQL 246 JOIN(s) in Hive QL 248 Explain Plan 250 Partitioned Table 252 Summary 252 CHAPTER 13: SURVEYING DATABASE INTERNALS 253 MongoDB Internals 254 MongoDB Wire Protocol 255 Inserting a Document 257 Querying a Collection 257 MongoDB Database Files 258 Membase Architecture 261 Hypertable Under the Hood 263 Regular Expression Support 263 Bloom Filter 264 Apache Cassandra 264 Peer-to-Peer Model 264 Based on Gossip and Anti-entropy 264 Fast Writes 265 Hinted Handoff 266 Berkeley DB 266 Storage Configuration 267 Summary 268 CHAPTER 14: CHOOSING AMONG NOSQL FLAVORS 271 Comparing NoSQL Products 272 Scalability 272 Transactional Integrity and Consistency 274 Data Modeling 275 Querying Support 277 xiii
Access and Interface Availability 278 Benchmarking Performance 279 50/50 Read and Update 280 95/5 Read and Update 280 Scans 280 Scalability Test 281 Hypertable Tests 281 Contextual Comparison 282 Summary 283 CHAPTER 15: COEXISTENCE 285 Using MySQL as a NoSQL Solution 285 Mostly Immutable Data Stores 289 Polyglot Persistence at Facebook 290 Data Warehousing and Business Intelligence 291 Web Frameworks and NoSQL 292 Using Rails with NoSQL 292 Using Django with NoSQL 293 Using Spring Data 295 Migrating from RDBMS to NoSQL 300 Summary 300 CHAPTER 16: PERFORMANCE TUNING 301 Goals of Parallel Algorithms 301 The Implications of Reducing Latency 301 How to Increase Throughput 302 Linear Scalability 302 Influencing Equations 303 Amdahl's Law 303 Little's Law 304 Message Cost Model 305 Partitioning 305 Scheduling in Heterogeneous Environments 306 Additional Map-Reduce Tuning 307 Communication Overheads 307 Compression 307 File Block Size 308 Parallel Copying 308 HBase Coprocessors 308 Leveraging Bloom Filters 309 Summary 309 xiv
CHAPTER 17: TOOLS AND UTILITIES 311 RRDTool 312 Nagios 314 Scribe 315 Flume 316 Chukwa 316 Pig 317 Interfacing with Pig 318 Pig Latin Basics 318 Nodetool 320 OpenTSDB 321 Solandra 322 Hummingbird and C5t 324 GeoCouch 325 Alchemy Database 325 Webdis 326 Summary 326 APPENDIX: INSTALLATION AND SETUP INSTRUCTIONS 329 Installing and Setting Up Hadoop 329 installing Hadoop 330 Configuring a Single-node Hadoop Setup 331 Configuring a Pseudo-distributed Mode Setup 331 Installing and Setting Up HBase 335 Installing and Setting Up Hive 335 Configuring Hive 336 Overlaying Hadoop Configuration 337 Installing and Setting Up Hypertable 337 Making the Hypertable Distribution FHS-Compliant 338 Configuring Hadoop with Hypertable 339 Installing and Setting Up MongoDB 339 Configuring MongoDB 340 Installing and Configuring CouchDB 340 Installing CouchDB from Source on Ubuntu 10.04 341 Installing and Setting Up Redis 342 Installing and Setting Up Cassandra 343 Configuring Cassandra 343 Configuring log4j for Cassandra 343 Installing Cassandra from Source 344 XV
Installing and Setting Up Membase Server and Memcached 344 Installing and Setting Up Nagios 345 Downloading and Building Nagios 346 Configuring Nagios 347 Compiling and Installing Nagios Plugins 348 Installing and Setting Up RRDtool 348 Installing Handler Socket for MySQL 349 INDEX 351 vi