Zookeeper ensemble in the Target cluster. The Zookeeper ensemble is provided as a configuration parameter in the Source configuration.

Size: px
Start display at page:

Download "Zookeeper ensemble in the Target cluster. The Zookeeper ensemble is provided as a configuration parameter in the Source configuration."

Transcription

1 Introduction The goal of the project is to replicate data to multiple Data Centers. The initial version of the solution will cover the active-passive scenario where data updates are replicated from a source Data Center to a target Data Center. Data updates include adding/updating and deleting documents. Three or more Data Centers can be configured in a daisy-chain, where a target Data Center is configured to be the source Data Center of a third Data Center. Data changes on the source Data Center are replicated to the target Data Center only after they are persisted to disk. The data changes can be replicated in real-time (with a small delay) or could be scheduled in intervals to the target Data Center. This solution pre-supposes that the source and target data centers begin with the same documents indexed. Of course the indexes may be empty to start. Each shard leader in the source Data Center will be responsible for replicating its updates to the appropriate shard leader in the target Data Center. When receiving updates from the source Data Center, shard leaders in the target Data Center will replicate the changes to its own replicas. This replication model is designed to tolerate some degradation in connectivity, accommodate limited bandwidth, and support batch updates to optimize communication. Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set up on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure consistency on the full index. Therefore any index created before CDCR was set up will have to be replicated by other means (see "Starting CDCR the first time with an existing index" below) in order that source and target indexes be fully consistent. The active-passive nature of the initial implementation implies a "push" model from the Source collection to the Target collection. Therefore, the Source configuration must be able to "see" the

2 Zookeeper ensemble in the Target cluster. The Zookeeper ensemble is provided as a configuration parameter in the Source configuration. CDCR is configured to replicate from collections in the Source cluster to collections in the Target cluster on a collection-by-collection basis. Since CDCR is configured in solrconfig.xml (on both Source and Target clusters), the settings can be tailored for the needs of each collection. Important: The Source and Target clusters must have the same number of shards and use the same document routing scheme ("compositeid" or "implicit"). CDCR can be configured to replicate from one collection to a second collection within the same cluster. That is a specialized scenario not covered in this document. Glossary Terms used in this document include: Node: A JVM instance running Solr; a server. Cluster: A cluster is a set of Solr nodes managed as a single unit. Data Center: A group of networked servers hosting a Solr cluster. In this document, the terms Cluster and Data Center are interchangeable as we assume that each Solr cluster is hosted in a different group of networked servers. Shard: In Solr, a logical section of a single collection. This may be spread across multiple nodes of the cluster. Each shard can have as many replicas as needed. Leader: Each shard has one node identified as its leader. All the writes for documents belonging to a shard are routed through the leader. Replica: A copy of a shard or single logical index, for use in failover or load balancing. Replicas comprising a shard can either be leaders or non-leaders. Follower: A convenience term for a replica that is not the leader of a shard. Collection: Multiple documents that make up one logical index. A cluster can have multiple collections. Updates Log: An append-only log of write operations maintained by each node.

3 Architecture Here is the data flow Data flow Updates and deletes are first written to the source cluster, then forwarded to the target cluster. The data flow sequence is: 1. A shard leader receives a new data update that is processed by its Update Processor. 2. The data update is first applied to the local index.

4 3. Upon successful application of the data update on the local index, the data update is added to the Updates Log queue. 4. After the data update is persisted to disk, the data update is sent to the replicas within the Data Center. 5. After Step 4 is successful CDCR reads the data update from the Updates Log and pushes it to the corresponding leader on Target Data Center. This is necessary in order to ensure consistency between the source and target Data Centers. 6. The leader on the target data center writes the data locally and forwards it to all its followers. Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push batch updates in order to minimize network communication. Also, if CDCR is unable to push the update at a given time -- for example, due to a degradation in connectivity -- it can retry later without any impact on the source Data Center. One implication of the architecture is that the leaders in the Source cluster must be able to "see" the leaders in the Target cluster. Since leaders may change, this effectively means that all nodes in the Source cluster must be able to "see" all Solr nodes in the Target cluster so firewalls, ACL rules, etc. must be configured with care. Major components There are a number of key features and components in CDCR s architecture: CDCR Configuration In order to configure CDCR, the source Data Center requires the host address of the Zookeeper cluster associated to the target Data Center. The Zookeeper host address is the only information needed by CDCR to instantiate the communication with the target Solr cluster. The CDCR configuration file on the source cluster will therefore contain a list of Zookeeper hosts.

5 The CDCR configuration file might also contain secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings, etc. CDCR will by default try to replicate all existing collections from the source Solr cluster to the target cluster with the the following requirements: the collections on the source and target clusters must be both configured to have the same number of shards. Replication of collections where the number of shards differ across clusters is not supported. the names of the collections must be identical on both clusters. If a collection does not exist on the target cluster, it will not be replicated. It is therefore required to manually create collections on the target cluster before starting the CDCR. CDCR Initialization In a bulk load scenario, where the index needs to be fully rebuilt, the push-based replication will be suboptimal as the Updates Log on the source Data Center will likely grow quickly, as the ingestion speed of the source Data Center will be much higher than the ingestion speed of the target Data Center. In this case, it is preferable to first load the index separately on the source and target Data Centers, and then initialize CDCR to start replicating updates between the two pre-built indexes. As a result, one requirement is to be able to initialize the replication both on a new empty index and on a pre-built index. In the scenario where the replication is setup on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure consistency on the full index. Consistency of the full index will depend of the procedure to pre-build the index. Inter-Data Center Communication Communication between Data Centers will occur exclusively between shard leaders and will be one-directional (from source to target) as explained earlier. To initiate a communication, the

6 shard leader on the source Data Center will have to first discover the related shard leader on the target Data Center. This discovery process will be performed by contacting the Zookeeper cluster associated to the target Data Center. Communication between Data Centers will be achieved through HTTP and the Solr REST API using the SolrJ client. The SolrJ client will be instantiated with the Zookeeper host of the target Data Center. SolrJ will manage the shard leader discovery process. Updates Tracking & Pushing CDCR replicates data updates from the source to the target Data Center by leveraging the Updates Log. A background thread regularly checks the Updates Log for new entries, and then forwards them to the target Data Center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update successfully processed in the Updates Log. Upon acknowledgement from the target Data Center that updates have been successfully processed, the Updates Log pointer is updated to reflect the current checkpoint. This pointer must be synchronized across all the replicas. In the case where the leader goes down and a new leader is elected, the new leader will be able to resume replication to the last update by using this synchronized pointer. The strategy to synchronize such a pointer across replicas will be explained next. If for some reason, the target Data Center is offline or fails to process the updates, the thread will periodically try to contact the target Data Center and push the updates. Synchronization of Update Checkpoints A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to avoid introducing inconsistency between the source and target Data Centers. Another important requirement is that the synchronization must be performed with minimal network traffic to maximize scalability.

7 In order to achieve this, the strategy is to: Uniquely identify each update operation. This unique identifier will serve as pointer. Rely on two storages: an ephemeral storage on the source shard leader, and a persistent storage on the target cluster. The shard leader in the source cluster will be in charge of generating a unique identifier for each update operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be sent to the target cluster as part of the update request. On the target Data Center side, the shard leader will receive the update request, store it along with the unique identifier in the Updates Log, and replicate it to the other shards. SolrCloud is already providing a unique identifier for each update operation, i.e., a version number. This version number is generated using a time-based lamport clock which is incremented for each update operation sent. This provides an happened-before ordering of the update operations that will be leveraged in (1) the initialisation of the update checkpoint on the source cluster, and in (2) the maintenance strategy of the Updates Log. The persistent storage on the target cluster is used only during the election of a new shard leader on the source cluster. If a shard leader goes down on the source cluster and a new leader is elected, the new leader will contact the target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a request, the target cluster will retrieve the latest identifier received across all the shards, and send it back to the source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its Update Logs and sent it back to a coordinator. The coordinator will have to select the highest among them. This strategy does not require any additional network traffic and ensures reliable pointer synchronization. Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that every update is applied to the leader but also to any of the replicas. If the leader goes down, a new leader is elected. During the leader election, a synchronization is performed between the new leader and the other replicas. As a result, this ensures that the new leader has a consistent Update Logs with the previous leader. Having a consistent Updates Log means that: On the source cluster, the update checkpoint can be reused by the new leader.

8 On the target cluster, the update checkpoint will be consistent between the previous and new leader. This ensures the correctness of the update checkpoint sent by a newly elected leader from the target cluster. Impact of Solr s Update Reordering The Updates Log can differ between the leader and the replicas, but not in an inconsistent way. During leader to replica synchronisation, Solr s Distributed Update Processor will take care of reordering the update operations based on their version number, and will drop any operations that are duplicate or could cause inconsistency. One of the consequence is that the target cluster can send back to the source cluster identifiers that do not exist anymore. However, given that the identifier is an incremental version number, the update checkpoint on the source cluster can be set to the next existing version number without introducing inconsistency. Replication Between Clusters with Different Topology The current design can work also in scenarios where replication is performed between clusters with a different topology, e.g., one source cluster with two shards and a target cluster with four shards. However, there is one limitation due to a clock skew (version number) problem across shards. In such a scenario, a target shard can receive updates from multiple source shards (as the document ids will be redistributed across shards due to the different cluster topology). This means that the version number generated by the source cluster must be global to the cluster in order to keep partial ordering of the updates. However, the version number is local to a shard. Given that it is likely to have a clock skew across shards, a target shard will receive updates with duplicate or non-ordered version numbers. This does not really cause problems for add and delete-by-id operations, since the local version number replicated to the target cluster will be able to keep partial ordering for a given document identifier. However, this causes issues for delete-by-query operations: When a cluster receives a delete-by-query, it is forwarded to each shard leader. Each shard leader will assign a version number (which can end up being different between shard leaders) to its delete-by-query, and replicate the delete-by-query to all the target shard leaders.

9 Version numbers. It will not be possible to duplicate and reorder properly these deleteby-query. One way to solve the problem of delete-by-query would be to have a clock synchronisation procedure when a delete-by-query is received, which would happen before the leader forwards the delete-by-query to the other leaders. The workflow would look like the following: A leader receives a delete-by-query This (primary) leader requests a clock synchronisation across the cluster (i.e., among the other leaders). The clock is synchronised by using the highest version numbers across all the leaders. At this stage, the primary leader can assign a version number to the delete-by-query, and forwards it to the other leaders. The secondary leaders does not overwrite and reuse the version number attached to the delete-by-query. The delete-by-query command will have the same version number across all the leaders. When the leaders will replicate the commands to the target data center, it then becomes possible to keep the partial ordering, since the source leaders have been synchronised and the delete-byquery commands have all the same version. Therefore, the problem boils down on how to implement a clock synchronisation procedure. Here is an initial proposal for a future option. Given that a synchronisation will be done rarely (only in the case of a delete-by-query), performance might be not critical for its implementation. A possible solution would be a 2-phase communication approach, where the primary leader will initiate the clock synch protocol, and will request the secondary leaders to: block/buffer updates send its latest version number to the primary leader await the answer of the primary leader with the new clock synchronize its clock It is far from being perfect, as things might become tricky if there are network or communication problems, but this is an initial idea to start discussion. Replication of Version Number

10 In Solr, the version numbers are information that is local to a cluster and are automatically generated by the Distributed Update Processor for each update request. It is currently not possible to force Solr to use a predefined version number. To implement the above strategy, we will have to extend Solr so that version numbers generated by the source cluster and provided as part of the update request are processed and reused internally by the target cluster. Maintenance of Updates Log The CDCR replication logic requires modification to the maintenance logic of the Updates Log on the source Data Center. Initially, the Updates Log acts as a fixed size queue, limited to 100 update entries. In the CDCR scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up through the last processed update by the target Data Center. Entries in the Update Logs are removed only when all pointers (one pointer per target Data Center) are after them. If the communication with one of the target Data Center is slow, the Updates Log on the source Data Center can grow to a substantial size. In such a scenario, it is necessary for the Updates Log to be able to efficiently find a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to implement efficient search strategy. Monitoring CDCR will provide the following monitoring capabilities over the replication operations: Monitoring of the outgoing and incoming replications, with information such as the source and target nodes, their status, etc. Statistics about the replication, with information such as operations (add/delete) per second, number of documents in the queue, etc. Information about the lifecycle and statistics will be provided on a per-shard basis by the CDC Replicator thread. The CDCR API can then aggregate this information at the cluster level. CDC Replicator

11 The CDC Replicator is a background thread that will be responsible for replicating updates from a source Data Center to one or more target Data Centers. It will also be responsible in providing monitoring information on a per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size pool of CDC Replicator threads that will be shared across shards. Limitations The current design of CDCR has some limitations. CDCR will continue to evolve over time and many of these limitations will be addressed. Among them are: CDCR is unlikely to be satisfactory for bulk-load situations where the update rate is high. In this scenario, the initial bulk load should be performed, the Source and Target data centers synchronized and CDCR be utilized for incremental updates. CDCR is currently only active-passive; data is pushed from the Source cluster to the Target cluster. There is active work being done in this area in the 6x code line to remove this limitation. Configuration The source and target configurations differ in the case of the data centers being in separate clusters. "Cluster" here means separate Zookeeper ensembles controlling disjoint Solr instances. Whether these data centers are physically separated or not is immaterial for this discussion. Source configuration Here is a sample of a source configuration file, a section in solrconfig.xml. The presence of the <replica> section causes CDCR to use this cluster as the source and should not be present in

12 the target collections in the cluster-to-cluster case. Details about each setting are the two examples: <requesthandler name="/cdcr" class="solr.cdcrrequesthandler"> <lst name="replica"> <str name="zkhost"> :2181</str> <str name="source">collection1</str> <str name="target">collection1</str> </lst> <lst name="replicator"> <str name="threadpoolsize">8</str> <str name="schedule">1000</str> <str name="batchsize">128</str> </lst> <lst name="updatelogsynchronizer"> <str name="schedule">1000</str> </lst>

13 </requesthandler> Target configuration Here is a typical two data-center target configuration. Target instance must configure an update processor chain that is specific to CDCR. The update processor chain must include the CdcrUpdateProcessorFactory. The task of this processor is to ensure that the version numbers attached to update requests coming from a CDCR source SolrCloud are reused and not overwritten by the target. A properly configured target configuration looks similar to this. <requesthandler name="/cdcr" class="solr.cdcrrequesthandler"> <lst name="buffer"> <str name="defaultstate">disabled</str> </lst> </requesthandler> <requesthandler name="/update" class="solr.updaterequesthandler"> <lst name="defaults"> <str name="update.chain">cdcr-processor-chain</str> </lst>

14 </requesthandler> <updaterequestprocessorchain name="cdcr-processor-chain"> <processor class="solr.cdcrupdateprocessorfactory"/> <processor class="solr.runupdateprocessorfactory"/> </updaterequestprocessorchain> <updatehandler class="solr.directupdatehandler2"> <updatelog class="solr.cdcrupdatelog"> <str name="dir">${solr.ulog.dir:</str> </updatelog> </updatehandler> Configuration details The configuration details, defaults and options are as follows: The replica element

15 CDCR can be configured to forward update requests to one or more replicas. A replica is defined with a replica list as follows: Parameter Required Default Description zkhost Yes none The host address for ZooKeeper of the target SolrCloud. Usually this is a comma-separated list of addresses to each node in the Target ZooKeeper ensemble. source Yes none The name of the collection on the source SolrCloud to be replicated. target Yes none The name of the collection on the target SolrCloud to which updates will be forwarded. The replicator element The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitor the update logs of the source collection and will forward any new updates to the target collection. The replicator uses a fixed thread pool to forward updates to multiple replicas in parallel. If more than one replica is configured, one thread will forward a batch of updates from one replica at a time in a round-robin fashion. The replicator can be configured with a replicator list as follows: Parameter Required Default Description threadpoolsize No 2 The number of threads to use for forwarding

16 updates. One thread per replica is recommended. schedule No 10 The delay in milliseconds for the monitoring the update log(s). batchsize No 128 The number of updates to send in one batch. The optimal size depends on the size of the documents. Large batches of large documents can increase your memory usage significantly. The updatelogsynchronizer element Expert: Non-leader nodes need to synchronize their update logs with their leader node from time to time in order to clean deprecated transaction log files. By default, such a synchronization process is performed every minute. The schedule of the synchronization can be modified with a updatelogsynchronizer list as follows: Parameter Required Default Description schedule No The delay in milliseconds for synchronizing the updates log. The buffer element CDCR is configured by default to buffer any new incoming updates. When buffering updates, the updates log will store all the updates indefinitely. Replicas do not need to buffer updates,

17 and it is recommended [WA1] to disable buffer on the target SolrCloud. The buffer can be disabled at startup with a buffer list and the parameter defaultstate as follows: Parameter Required Default Description defaultstate No enabled The state of the buffer at startup. CDCR API The CDCR API is used to control and monitor the replication process. Control actions are performed at a collection level, i.e., by using the following base URL for API calls: Monitor actions are performed at a core level, i.e., by using the following base URL for API calls: Currently, none of the CDCR API calls have any parameters. API Entry points (control) collection/cdcr?action=status: TBD, Link all these to commands and add action explanation. And change the font. collection/cdcr?action=start: Starts CDCR replication collection/cdcr?action=stopped: Stops CDCR replication. collection/cdcr?action=enablebuffer: Enables the buffering of updates.

18 collection/cdcr?action=disablebuffer: Disables the buffering of updates. API Entry points (monitoring) core/cdcr?action=queues: Fetches statistics about the queue for each replica and about the update logs. core/cdcr?action=ops: Fetches statistics about the replication performance (operations per second) for each replica core/cdcr?action=errors: Fetches statistics and other information about replication errors for each replica. Control commands /collection/cdcr?action=status Input Query Parameters: There are no parameters to this command. Output Output Content The current state of the CDCR, which includes the state of the replication process and the state of the buffer. Examples Input: There are no parameters to this command.

19 Output { "responseheader": { "status": 0, "QTime": 0, "status": { "process": "stopped", "buffer": "enabled" /collection/cdcr?actionenablebuffer: Input Query Parameters: There are no parameters to this command.

20 Output Output Content The status of the process and an indication of whether the buffer is enabled Examples Input This command enables the buffer, there are no parameters. Output { "responseheader": { "status": 0, "QTime": 0, "status": { "process": "started", "buffer": "enabled"

21 /collection/cdcr?actiondisablebuffer: Input Query Parameters: There are no parameters to this command Output Output Content: The status of CDCR and an indication that the buffer is disabled. Examples Input: This command disables buffering Output: The status of CDCR and an indication that the buffer is disabled. { "responseheader": { "status": 0,

22 "QTime": 0, "status": { "process": "started", "buffer": "disabled" /collection/cdcr?action=start Input Query Parameters: There are no parameters for this action Output Output Content: Confirmation that CDCR is started and the status of buffering Examples Input

23 Output { "responseheader": { "status": 0, "QTime": 0, "status": { "process": "started", "buffer": "enabled" /collection/cdcr?actionstopped: Input Query Parameters: There are no parameters for this command. Output

24 Output Content: The status of CDCR, including the confirmation that CDCR is stopped Examples Input Output { "responseheader": { "status": 0, "QTime": 0, "status": { "process": "stopped", "buffer": "enabled"

25 Monitoring commands /core/cdcr?action=queues Input Query Parameters: There are no parameters for this command Output Output Content The output is composed of a list queues which contains a list of (Zookeeper) target hosts, themselves containing a list of target collections. For each collection, the current size of the queue and the timestamp of the last update operation successfully processed is provided. The timestamp of the update operation is the original timestamp, i.e., the time this operation was processed on the source SolrCloud. This allows an estimate the latency of the replication process. The queues object also contains information about the updates log, such as the size (in bytes) of the updates log on disk ( tlogtotalsize ), the number of transaction log files ( tlogtotalcount ) and the status of the updates log synchronizer ( updatelogsynchronizer ). Examples Input

26 Output { responseheader={ status=0, QTime=1, queues={ : 40342/solr={ target_collection={ queuesize=104, lasttimestamp= t10: 32: Z, tlogtotalsize=3817, tlogtotalcount=1, updatelogsynchronizer=stopped

27 /core/cdcr?action=ops Input Query Parameters: There are no parameters for this command. Output Output Content: The output is composed of a list operationspersecond which contains a list of (Zookeeper) target hosts, themselves containing a list of target collections. For each collection, the average number of processed operations per second since the start of the replication process is provided. The operations are further broken down into two groups: add and delete operations. Examples Input Output { responseheader={ status=0,

28 QTime=1, operationspersecond={ : 59661/solr={ target_collection={ all= , adds= , deletes=0.0 /core/cdcr?action=errors Input Query Parameters: There are no parameters for this command. Output

29 Output Content: The output is composed of a list errors which contains a list of (Zookeeper) target hosts, themselves containing a list of target collections. For each collection, information about errors encountered during the replication is provided, such as the number of consecutive errors encountered by the replicator thread, the number of bad request or internal errors since the start of the replication process, and a list of the last errors encountered ordered by timestamp. Examples Input Output { responseheader={ status=0, QTime=2, errors={ : 36872/solr={

30 target_collection={ consecutiveerrors=3, bad_request=0, internal=3, last={ T11: 04: Z=internal, T11: 04: Z=internal, T11: 04: 38.22Z=internal Starting CDCR the first time with an existing index:

31 This is a general approach for initializing CDCR in a production environment based upon an approach taken by the initial working installation of CDCR: Customer uses the CDCR approach to keep a remote DR instance available for production backup. This is an active-passive solution. Customer has 26 clouds with 200 million assets per cloud (15GB indexes). Total documents is over 4.8 billion. Source and target clouds were synched in 2-3 hour maintenance windows to establish the base index for the targets. Tip: As usual, it is good to start small. Sync a single cloud and monitor for a period of time before doing the others. You may need to adjust your settings several times before finding the right balance. Before starting, stop or pause the indexers. This is best done during a small maintenance window. Stop the solr cloud instances at the source Include the cdcr request handler configuration in solrconfig.xml <requesthandler name="/cdcr" class="solr.cdcrrequesthandler"> <lst name="replica"> <str name="zkhost">${targetzk</str> <str name="source">${sourcecollection</str> <str name="target">${targetcollection</str> </lst> <lst name="replicator">

32 <str name="threadpoolsize">8</str> <str name="schedule">10</str> <str name="batchsize">2000</str> </lst> <lst name="updatelogsynchronizer"> <str name="schedule">1000</str> </lst> </requesthandler> <updaterequestprocessorchain name="cdcr-processor-chain"> <processor class="solr.cdcrupdateprocessorfactory" /> <processor class="solr.runupdateprocessorfactory" /> </updaterequestprocessorchain> Upload the modified solrconfig.xml to zookeeper on both source and target Sync the index directories from the source collection to target collection across to the corresponding shard nodes. Tip: rsync works well for this. For example: if there are 2 shards on collection1 with 2 replicas for each shard, copy the corresponding index directories from

33 shard1replic a1source shard1repli ca1target shard1replic a2source shard1repli ca2target shard2replic a1source shard2repli ca1target shard2replic a2source shard2repli ca2target Start the zookeeper on the target (DR) side Start the solr cloud on the target (DR) side Start the zookeeper on the source side Start the solr cloud on the source side Tip: As a general rule, the target (DR) side of the solr cloud should be started before the source side. Activate the CDCR on source instance using the cdcr api

34 There is no need to run the /cdcr?action=start command on the target Disable the buffer on the target Renable indexing Monitoring: 1. Network and disk space monitoring are essential. Ensure that the system has plenty of available storage to queue up changes if there is a disconnect between the source and target. A network outage between the two data centers can cause your disks to grow. a. Tip: Set a monitor for your disks to send alerts when the disk gets over a certain percentage (eg. 70%) b. Tip: Run a test. With moderate indexing, how long can the system queue changes before you run out of disk space? 2. Create a simple way to check the counts between the source and the target. a. Keep in mind that if indexing is running, the source and target may not match document for document. Set an alert to fire if the difference is greater than some percentage of the overall cloud size.

35 Zookeeper settings: 1. With CDCR, the target zookeepers will have connections from the target clouds and the source clouds. You may need to increase the maxclientcnxns setting in the zoo.cfg. ## set numbers of connection to 200 from client ## is maxclientcnxns=0 that means no limit maxclientcnxns=800 Upgrading and Patching Production: 1. When rolling in upgrades to your indexer or application, you should shutdown the source (production) and the target (DR). Depending on your setup, you may want to pause/stop indexing. Deploy the release or patch and renable indexing. Then start the target (DR). a. Tip: There is no need to reissue the DISABLEBUFFERS or START commands. These are persisted. b. Tip: After starting the target, run a simple test. Add a test document to each of the source clouds. Then check for it on the target. #send to the source curl -H 'Content-type:application/json' -d '[{"SKU":"ABC"]'

36 #check the target curl "

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer Cross Data Center Replication in Apache Solr Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer Who are we? Based in Bedford, MA. Offices all around the world Progress tools and platforms enable

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć  sematext.com Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć Sematext International @kucrafal @sematext sematext.com Who Am I Solr 3.1 Cookbook author (4.0 inc) Sematext consultant & engineer Solr.pl

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

EPL660: Information Retrieval and Search Engines Lab 3

EPL660: Information Retrieval and Search Engines Lab 3 EPL660: Information Retrieval and Search Engines Lab 3 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Apache Solr Popular, fast, open-source search platform built

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Dell EMC CIFS-ECS Tool

Dell EMC CIFS-ECS Tool Dell EMC CIFS-ECS Tool Architecture Overview, Performance and Best Practices March 2018 A Dell EMC Technical Whitepaper Revisions Date May 2016 September 2016 Description Initial release Renaming of tool

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Road to Auto Scaling

Road to Auto Scaling Road to Auto Scaling Varun Thacker Lucidworks Apache Lucene/Solr Committer, and PMC member Agenda APIs Metrics Recipes Auto-Scale Triggers SolrCloud Overview ZooKee per Lots Shard 1 Leader Shard 3 Replica

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira

More information

Using VMware vsphere Replication. vsphere Replication 6.5

Using VMware vsphere Replication. vsphere Replication 6.5 Using VMware vsphere Replication 6.5 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/ If you have comments about this documentation, submit your

More information

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Paxos Replicated State Machines as the Basis of a High- Performance Data Store Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals... 2 1.2 Data model and the hierarchical namespace... 3 1.3 Nodes and ephemeral nodes...

More information

Microsoft SQL Server Fix Pack 15. Reference IBM

Microsoft SQL Server Fix Pack 15. Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Note Before using this information and the product it supports, read the information in Notices

More information

Dispatcher. Phoenix. Dispatcher Phoenix Enterprise White Paper Version 0.2

Dispatcher. Phoenix. Dispatcher Phoenix Enterprise White Paper Version 0.2 Dispatcher Phoenix Dispatcher Phoenix Enterprise CONTENTS Introduction... 3 Terminology... 4 Planning & Considerations... 5 Security Features... 9 Enterprise Features... 10 Cluster Overview... 11 Deployment

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

GR Reference Models. GR Reference Models. Without Session Replication

GR Reference Models. GR Reference Models. Without Session Replication , page 1 Advantages and Disadvantages of GR Models, page 6 SPR/Balance Considerations, page 7 Data Synchronization, page 8 CPS GR Dimensions, page 9 Network Diagrams, page 12 The CPS solution stores session

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Documentation Accessibility. Access to Oracle Support

Documentation Accessibility. Access to Oracle Support Oracle NoSQL Database Availability and Failover Release 18.3 E88250-04 October 2018 Documentation Accessibility For information about Oracle's commitment to accessibility, visit the Oracle Accessibility

More information

Installing and configuring Apache Kafka

Installing and configuring Apache Kafka 3 Installing and configuring Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Installing Kafka...3 Prerequisites... 3 Installing Kafka Using Ambari... 3... 9 Preparing the Environment...9

More information

DELIVERING PERFORMANCE, SCALABILITY, AND AVAILABILITY ON THE SERVICENOW NONSTOP CLOUD

DELIVERING PERFORMANCE, SCALABILITY, AND AVAILABILITY ON THE SERVICENOW NONSTOP CLOUD DELIVERING PERFORMANCE, SCALABILITY, AND AVAILABILITY ON THE SERVICENOW NONSTOP CLOUD Overview Organizations, regardless of size, rely upon access to IT and business data and services for their continued

More information

Azure SQL Database for Gaming Industry Workloads Technical Whitepaper

Azure SQL Database for Gaming Industry Workloads Technical Whitepaper Azure SQL Database for Gaming Industry Workloads Technical Whitepaper Author: Pankaj Arora, Senior Software Engineer, Microsoft Contents 1 Introduction... 2 2 Proven Platform... 2 2.1 Azure SQL Database

More information

EMC Centera Replication

EMC Centera Replication A Detailed Review Abstract This white paper describes existing and new replication capabilities available in EMC CentraStar 4.0. It covers all replication topologies and use cases. July 2008 Copyright

More information

AppSense DataNow. Release Notes (Version 4.0) Components in this Release. These release notes include:

AppSense DataNow. Release Notes (Version 4.0) Components in this Release. These release notes include: AppSense DataNow Release Notes (Version 4.0) These release notes include: Components in this Release Important Upgrade Information New Features Bugs Fixed Known Issues and Limitations Supported Operating

More information

EMC RecoverPoint. EMC RecoverPoint Support

EMC RecoverPoint. EMC RecoverPoint Support Support, page 1 Adding an Account, page 2 RecoverPoint Appliance Clusters, page 3 Replication Through Consistency Groups, page 4 Group Sets, page 22 System Tasks, page 24 Support protects storage array

More information

TANDBERG Management Suite - Redundancy Configuration and Overview

TANDBERG Management Suite - Redundancy Configuration and Overview Management Suite - Redundancy Configuration and Overview TMS Software version 11.7 TANDBERG D50396 Rev 2.1.1 This document is not to be reproduced in whole or in part without the permission in writing

More information

F5 BIG-IQ Centralized Management: Local Traffic & Network. Version 5.2

F5 BIG-IQ Centralized Management: Local Traffic & Network. Version 5.2 F5 BIG-IQ Centralized Management: Local Traffic & Network Version 5.2 Table of Contents Table of Contents BIG-IQ Local Traffic & Network: Overview... 5 What is Local Traffic & Network?... 5 Understanding

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

BEAAquaLogic. Service Bus. MQ Transport User Guide

BEAAquaLogic. Service Bus. MQ Transport User Guide BEAAquaLogic Service Bus MQ Transport User Guide Version: 3.0 Revised: February 2008 Contents Introduction to the MQ Transport Messaging Patterns......................................................

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

EXAM Administering Microsoft SQL Server 2012 Databases. Buy Full Product.

EXAM Administering Microsoft SQL Server 2012 Databases. Buy Full Product. Microsoft EXAM - 70-462 Administering Microsoft SQL Server 2012 Databases Buy Full Product http://www.examskey.com/70-462.html Examskey Microsoft 70-462 exam demo product is here for you to test the quality

More information

BookKeeper overview. Table of contents

BookKeeper overview. Table of contents by Table of contents 1...2 1.1 BookKeeper introduction...2 1.2 In slightly more detail...2 1.3 Bookkeeper elements and concepts... 3 1.4 Bookkeeper initial design... 3 1.5 Bookkeeper metadata management...

More information

VMware Mirage Getting Started Guide

VMware Mirage Getting Started Guide Mirage 5.8 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document,

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Abstract. Introduction

Abstract. Introduction Highly Available In-Memory Metadata Filesystem using Viewstamped Replication (https://github.com/pkvijay/metadr) Pradeep Kumar Vijay, Pedro Ulises Cuevas Berrueco Stanford cs244b-distributed Systems Abstract

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Building and Running a Solr-as-a-Service SHAI ERERA IBM

Building and Running a Solr-as-a-Service SHAI ERERA IBM Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social Analytics & Technologies Lucene/Solr committer and PMC member http://shaierera.blogspot.com shaie@apache.org Background

More information

Distributed Systems. Tutorial 9 Windows Azure Storage

Distributed Systems. Tutorial 9 Windows Azure Storage Distributed Systems Tutorial 9 Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012 Windows Azure Storage (WAS) A scalable cloud storage system In production

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Tuning Enterprise Information Catalog Performance

Tuning Enterprise Information Catalog Performance Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

BIG-IQ Centralized Management: ADC. Version 5.0

BIG-IQ Centralized Management: ADC. Version 5.0 BIG-IQ Centralized Management: ADC Version 5.0 Table of Contents Table of Contents BIG-IQ Application Delivery Controller: Overview...5 What is Application Delivery Controller?...5 Managing Device Resources...7

More information

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX / MySQL High Availability Michael Messina Senior Managing Consultant, Rolta-AdvizeX mmessina@advizex.com / mike.messina@rolta.com Introduction Michael Messina Senior Managing Consultant Rolta-AdvizeX, Working

More information

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Become a MongoDB Replica Set Expert in Under 5 Minutes:

Become a MongoDB Replica Set Expert in Under 5 Minutes: Become a MongoDB Replica Set Expert in Under 5 Minutes: USING PERCONA SERVER FOR MONGODB IN A FAILOVER ARCHITECTURE This solution brief outlines a way to run a MongoDB replica set for read scaling in production.

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1 Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1 version 1.0 July, 2007 Table of Contents 1. Introduction...3 2. Best practices...3 2.1 Preparing the solution environment...3

More information

Ranger 0.5 Audit Configuration

Ranger 0.5 Audit Configuration Ranger 0.5 Audit Configuration Introduction Scope of this document Configuration properties naming convention Audit to Solr Audit to Db Audit to HDFS Audit to Log4j Example Configure a log4j appender for

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

Plug-in Configuration

Plug-in Configuration Overview, page 1 Threading Configuration, page 2 Portal Configuration, page 3 Async Threading Configuration, page 3 Custom Reference Data Configuration, page 4 Balance Configuration, page 6 Diameter Configuration,

More information

Tiger Bridge 1.0 Administration Guide

Tiger Bridge 1.0 Administration Guide Tiger Bridge 1.0 Administration Guide September 12, 2017 Copyright 2008-2017 Tiger Technology. All rights reserved. This publication, or parts thereof, may not be reproduced in any form, by any method,

More information

Performance comparisons and trade-offs for various MySQL replication schemes

Performance comparisons and trade-offs for various MySQL replication schemes Performance comparisons and trade-offs for various MySQL replication schemes Darpan Dinker VP Engineering Brian O Krafka, Chief Architect Schooner Information Technology, Inc. http://www.schoonerinfotech.com/

More information

BEAAquaLogic. Service Bus. Native MQ Transport User Guide

BEAAquaLogic. Service Bus. Native MQ Transport User Guide BEAAquaLogic Service Bus Native MQ Transport User Guide Version: 2.6 RP1 Revised: November 2007 Contents Introduction to the Native MQ Transport Advantages of Using the Native MQ Transport................................

More information

Hedvig as backup target for Veeam

Hedvig as backup target for Veeam Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...

More information

Ping Driver PTC Inc. All Rights Reserved.

Ping Driver PTC Inc. All Rights Reserved. 2017 PTC Inc. All Rights Reserved. 2 Table of Contents 1 Table of Contents 2 3 Overview 4 Channel Properties General 4 Channel Properties Ethernet Communications 5 Channel Properties Write Optimizations

More information

Scaling for Humongous amounts of data with MongoDB

Scaling for Humongous amounts of data with MongoDB Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com From here... http://bit.ly/ot71m4 ...to here... http://bit.ly/oxcsis

More information

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, PushyDB Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, osong}@mit.edu https://github.com/jeffchan/6.824 1. Abstract PushyDB provides a more fully featured database that exposes

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Equitrac Office and Express DCE High Availability White Paper

Equitrac Office and Express DCE High Availability White Paper Office and Express DCE High Availability White Paper 2 Summary............................................................... 3 Introduction............................................................

More information

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group WHITE PAPER: BEST PRACTICES Sizing and Scalability Recommendations for Symantec Rev 2.2 Symantec Enterprise Security Solutions Group White Paper: Symantec Best Practices Contents Introduction... 4 The

More information

Load Balancing Overview

Load Balancing Overview The "Load Balancing" feature is available only in the Barracuda Web Application Firewall 460 and above. A load balancer is a networking device that distributes traffic across multiple back-end servers

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

VERITAS Volume Replicator. Successful Replication and Disaster Recovery

VERITAS Volume Replicator. Successful Replication and Disaster Recovery VERITAS Volume Replicator Successful Replication and Disaster Recovery V E R I T A S W H I T E P A P E R Table of Contents Introduction.................................................................................1

More information

NoSQL Databases Analysis

NoSQL Databases Analysis NoSQL Databases Analysis Jeffrey Young Intro I chose to investigate Redis, MongoDB, and Neo4j. I chose Redis because I always read about Redis use and its extreme popularity yet I know little about it.

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Identity Firewall. About the Identity Firewall

Identity Firewall. About the Identity Firewall This chapter describes how to configure the ASA for the. About the, on page 1 Guidelines for the, on page 7 Prerequisites for the, on page 9 Configure the, on page 10 Monitoring the, on page 16 History

More information

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Managing IoT and Time Series Data with Amazon ElastiCache for Redis Managing IoT and Time Series Data with ElastiCache for Redis Darin Briskman, ElastiCache Developer Outreach Michael Labib, Specialist Solutions Architect 2016, Web Services, Inc. or its Affiliates. All

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

System Description. System Architecture. System Architecture, page 1 Deployment Environment, page 4

System Description. System Architecture. System Architecture, page 1 Deployment Environment, page 4 System Architecture, page 1 Deployment Environment, page 4 System Architecture The diagram below illustrates the high-level architecture of a typical Prime Home deployment. Figure 1: High Level Architecture

More information

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5 Contents Chapter 1 About in this guide... 4 What's in this guide... 4 Documents related to NetBackup in highly available environments... 5 Chapter 2 NetBackup protection against single points of failure...

More information

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path. Hashing B+-tree is perfect, but... Selection Queries to answer a selection query (ssn=) needs to traverse a full path. In practice, 3-4 block accesses (depending on the height of the tree, buffering) Any

More information

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,

More information

MySQL Cluster Web Scalability, % Availability. Andrew

MySQL Cluster Web Scalability, % Availability. Andrew MySQL Cluster Web Scalability, 99.999% Availability Andrew Morgan @andrewmorgan www.clusterdb.com Safe Harbour Statement The following is intended to outline our general product direction. It is intended

More information

Reference Architecture

Reference Architecture vrealize Automation 7.0.1 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions

More information

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB AlwaysOn Availability Groups: Backups, Restores, and CHECKDB www.brentozar.com sp_blitz sp_blitzfirst email newsletter videos SQL Critical Care 2016 Brent Ozar Unlimited. All rights reserved. 1 What I

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

VoltDB vs. Redis Benchmark

VoltDB vs. Redis Benchmark Volt vs. Redis Benchmark Motivation and Goals of this Evaluation Compare the performance of several distributed databases that can be used for state storage in some of our applications Low latency is expected

More information

Scribe Insight 6.5. Release Overview and Technical Information Version 1.0 April 7,

Scribe Insight 6.5. Release Overview and Technical Information Version 1.0 April 7, Scribe Insight 6.5 Release Overview and Technical Information Version 1.0 April 7, 2009 www.scribesoft.com Contents What is Scribe Insight?... 3 Release Overview... 3 Product Management Process Customer

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure Mario Beck (mario.beck@oracle.com) Principal Sales Consultant MySQL Session Agenda Requirements for

More information

How to Scale MongoDB. Apr

How to Scale MongoDB. Apr How to Scale MongoDB Apr-24-2018 About me Location: Skopje, Republic of Macedonia Education: MSc, Software Engineering Experience: Lead Database Consultant (since 2016) Database Consultant (2012-2016)

More information

Key-value store with eventual consistency without trusting individual nodes

Key-value store with eventual consistency without trusting individual nodes basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed

More information

MarkLogic Server. Database Replication Guide. MarkLogic 6 September, Copyright 2012 MarkLogic Corporation. All rights reserved.

MarkLogic Server. Database Replication Guide. MarkLogic 6 September, Copyright 2012 MarkLogic Corporation. All rights reserved. Database Replication Guide 1 MarkLogic 6 September, 2012 Last Revised: 6.0-1, September, 2012 Copyright 2012 MarkLogic Corporation. All rights reserved. Database Replication Guide 1.0 Database Replication

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services Deep Dive Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information