Introduction
Cassandra (created at Facebook for inbox search) like HBase is a NoSQL database, generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL and designed to manage extremely large data sets with manipulation capabilities. It is a distributed database, clients can connect to any node in the cluster and access any data.
Besides Cassandra, we have the following NoSQL databases that are quite popular:Apache HBase and MongoDB
The primary container of data is a keyspace , which is like a database in an RDBMS. Inside a keyspace are one or more column families , which are like relational tables, but they are more fluid and dynamic in structure. Column families have one to many thousands of columns, with both primary and secondary indexes on columns being supported.
In Cassandra, objects are created, data is inserted and manipulated, and information queried via CQL – the Cassandra Query Language, which looks nearly identical to SQL. Developers coming from the relational world will be right at home with CQL and will use standard commands (e.g., INSERT, SELECT) to interact with objects and data stored in Cassandra.
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.
NoSQLDatabase
A NoSQL database (sometimes alled as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
Relational Database | NoSql Database |
---|---|
Supports powerful query language. | Supports very simple query language. |
It has a fixed schema. | No fixed schema. |
Follows ACID (Atomicity, Consistency, Isolation, and Durability). | It is only “eventually consistent”. |
Supports transactions. | Does not support transactions. |
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra:
Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra:
Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
Cassandra - Architecture
Components of Cassandra
The key components of Cassandra are as follows
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − A cluster is a component that contains one or more data centers.
Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
Cassandra Query Language
Write Operations
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data.
Cassandra - Data Model
Components of Cassandra
The key components of Cassandra are as follows
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − A cluster is a component that contains one or more data centers.
Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
Cassandra Query Language
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.
Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table.Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data.
Snitches
A snitch determines which datacenters and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into datacenters and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and datacenter. Cassandra does its best not to have more than one replica on the same rack (which is not necessarily a physical location).
Cluster: Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.
Keyspace:
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in Cassandra are
Replication factor − It is the number of machines in the cluster that will receive copies of the same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have strategies such as simple strategy(rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy).
Column families − Keyspace is a container for a list of one or more column families. A column family, in turn, is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.
Column
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp. Given below is the structure of a column.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is important to keep columns that you are likely to query together in the same column family, and a super column can be helpful here.Given below is the structure of a super column.
Column
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp. Given below is the structure of a column.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is important to keep columns that you are likely to query together in the same column family, and a super column can be helpful here.Given below is the structure of a super column.
Installation
You can download Cassandra and install from below link, I used the Installation from RPM packages;
http://cassandra.apache.org/download/http://www.apache.org/dyn/closer.lua/cassandra/3.11.0/apache-cassandra-3.11.0-bin.tar.gz
[hdpsysuser@dn01 ~]$ cd /usr/hadoopsw/
[hdpsysuser@dn01 ~]$ tar zxvf apache-cassandra-3.11.0-bin.tar.gz
[hdpsysuser@dn01 ~]$ tar zxvf apache-cassandra-3.11.0-bin.tar.gz
1- Untar the file somewhere
root@dn01 hadoopsw]# tar -xvf apache-cassandra-3.11.0-bin.tar.gz
2- Start Cassandra in the foreground by invoking bin/cassandra -f from the command line. Press “Control-C” to stop Cassandra. Start Cassandra in the background by invoking bin/cassandra from the command line. Invoke kill pid or pkill -f CassandraDaemon to stop Cassandra, where pid is the Cassandra process id, which you can find for example by invoking pgrep -f CassandraDaemon.
[root@dn01 hadoopsw]# cassandra -f
Running Cassandra as root user or group is not recommended - please start Cassandra using a different system user.
If you really want to force running Cassandra as root, use -R command line option.
[root@dn01 hadoopsw]# useradd cass
[root@dn01 hadoopsw]# chown -R cass:cass /usr/hadoopsw/apache-cassandra-3.11.0
[root@dn01 hadoopsw]# su - cass
[cass@dn01 ~]$ cat ~/.bash_profile
# .bash_profile
#######Cassandra Variables##########
export CASSANDRA_HOME=/usr/hadoopsw/apache-cassandra-3.11.0
export PATH=$PATH:$CASSANDRA_HOME/bin
[cass@dn01 ~]$ source ~/.bash_profile
[cass@dn01 ~]$ cassandra -f
CTRL+C
[cass@dn01 ~]$ cassandra
3- Verify that Cassandra is running by invoking bin/nodetool status from the command line.
[cass@dn01 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 174.71 KiB 256 100.0% e4382ae1-33a0-4a1d-9451-62978b9833be rack1
4- Configuration files are located in the CASSANDRA_HOME/conf sub-directory. Since Cassandra 2.1, log and data directories are located in the CASSANDRA_HOME/logs and CASSANDRA_HOME/data sub-directories respectively.
Configure Cassandra
For running Cassandra on a single node, the steps above are enough, you don’t really need to change any configuration. However, when you deploy a cluster of nodes, or use clients that are not on the same host, then there are some parameters that must be changed.
The Cassandra configuration files can be found in the conf directory of tarballs. For packages, the configuration files will be located in /etc/cassandra.
Main runtime properties
Most of configuration in Cassandra is done via yaml properties that can be set in cassandra.yaml. At a minimum you should consider setting the following properties:
cluster_name: the name of your cluster.
seeds: a comma separated list of the IP addresses of your cluster seeds.
storage_port: you don’t necessarily need to change this but make sure that there are no firewalls blocking this port.
listen_address: the IP address of your node, this is what allows other nodes to communicate with this node so it is important that you change it. Alternatively, you can set listen_interface to tell Cassandra which interface to use, and consecutively which address to use. Set only one, not both.
native_transport_port: as for storage_port, make sure this port is not blocked by firewalls as clients will communicate with Cassandra on this port.
Changing the location of directories
The following yaml properties control the location of directories:
data_file_directories: one or more directories where data files are located.
commitlog_directory: the directory where commitlog files are located.
saved_caches_directory: the directory where saved caches are located.
hints_directory: the directory where hints are located.
For performance reasons, if you have multiple disks, consider putting commitlog and data files on different disks.
You can repeat the above steps on the other nodes if you span your cluster more than one node.
[cass@dn02 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.49.136 175.47 KiB 256 100.0% 857f41ec-2fbc-456f-a349-437a7fee7e1f rack1
UN 192.168.49.135 337.98 KiB 256 100.0% e4382ae1-33a0-4a1d-9451-62978b9833be rack1
[cass@dn03 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.49.136 237.26 KiB 256 66.1% 857f41ec-2fbc-456f-a349-437a7fee7e1f rack1
UN 192.168.49.137 103.71 KiB 256 64.6% 5f9089b2-6d0a-4660-962a-9db81887b2fd rack1
UN 192.168.49.135 235.46 KiB 256 69.3% e4382ae1-33a0-4a1d-9451-62978b9833be rack1
JVM-level settings such as heap size can be set in cassandra-env.sh. You can add any additional JVM command line argument to the JVM_OPTS environment variable; when Cassandra starts these arguments will be passed to the JVM.
Logging
The logger in use is logback. You can change logging properties by editing logback.xml. By default it will log at INFO level into a file called system.log and at debug level into a file called debug.log. When running in the foreground, it will also log at INFO level to the console.
Internode communications (gossip)
In Cassandra internode communication is performed using Gossip which is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.
To prevent problems in gossip communications, use the same list of seed nodes for all nodes in a cluster. In multiple data-center clusters, the seed list should include at least one node from each datacenter (replication group). More than a single seed node per datacenter is recommended for fault tolerance.It is recommended to use a small seed list (approximately three nodes per datacenter).
Connect/working with Cassandra using cqlsh
Connect Locally
[cass@dn01 ~]$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> SELECT cluster_name, listen_address FROM system.local;
cluster_name | listen_address
--------------+----------------
Test Cluster | 192.168.49.135
(1 rows)
cqlsh> help
[cass@dn01 ~]$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> SELECT cluster_name, listen_address FROM system.local;
cluster_name | listen_address
--------------+----------------
Test Cluster | 192.168.49.135
(1 rows)
cqlsh> help
Connect Remotely
[cass@dn03 ~]$ cqlsh dn03 9042
Connection error: ('Unable to connect to any servers', {'192.168.49.138': error(111, "Tried connecting to [('192.168.49.138', 9042)]. Last error: Connection refused")})
[cass@dn03 ~]$ netstat -lnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:8010 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:9042 0.0.0.0:* LISTEN
tcp 0 0 192.168.122.1:53 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN
tcp 0 0 192.168.49.135:7000 0.0.0.0:* LISTEN
...
...
Change below property value from localhost to the name of the node in cassandra.yaml and restart cassandra on that node.
rpc_address: dn03
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:8010 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:54832 0.0.0.0:* LISTEN
tcp 0 0 192.168.49.137:9042 0.0.0.0:* LISTEN
tcp 0 0 192.168.122.1:53 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
[cass@dn01 ~]$ cqlsh dn03 9042
Connected to Test Cluster at dn03:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>
cqlsh> capture '/tmp/cass_output.txt'
Now capturing query output to '/tmp/cass_output.txt'.
cqlsh> capture off;
-- Describe the current cluster of Cassandra and its objects
cqlsh> describe cluster;
Cluster: Test Cluster
Partitioner: Murmur3Partitioner
-- List all the keyspaces in a cluster
cqlsh> describe keyspaces;
system_traces system_schema system_auth system system_distributed
-- List all the tables in a keyspace
cqlsh> describe tables;
Keyspace system_traces
----------------------
events sessions
Keyspace system_schema
----------------------
tables triggers views keyspaces dropped_columns
functions aggregates indexes types columns
Keyspace system_auth
--------------------
resource_role_permissons_index role_permissions role_members roles
Keyspace system
---------------
available_ranges peers batchlog transferred_ranges
batches compaction_history size_estimates hints
prepared_statements sstable_activity built_views
"IndexInfo" peer_events range_xfers
views_builds_in_progress paxos local
Keyspace system_distributed
---------------------------
repair_history view_build_status parent_repair_history
-- Describe a Table
cqlsh> describe system_traces.sessions;
CREATE TABLE system_traces.sessions (
session_id uuid PRIMARY KEY,
client inet,
command text,
coordinator inet,
duration int,
parameters map<text, text>,
request text,
started_at timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = 'tracing sessions'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
--Describe a user-defined data type
cqlsh> describe types
cqlsh> describe types <<typeName>>
-- To expand the output on/off
cqlsh> expand on ;
Now Expanded output is enabled
cqlsh> expand off;
cqlsh> exit
cqlsh> show host
Connected to Test Cluster at 127.0.0.1:9042.
-- Execute the commands in a filevi /data/cass_input_file.cas'
select * from system.local;
cqlsh> source '/data/cass_input_file.cas';
A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.
cqlsh> CREATE KEYSPACE scott
... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
cqlsh> describe keyspaces;
system_schema system_auth system scott system_distributed system_traces
cqlsh> use scott;
cqlsh:scott>
cqlsh:scott> ALTER KEYSPACE scott
... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
cqlsh:scott> drop keyspace test;
-- Table Operations
CREATE TABLE emp(
empno int PRIMARY KEY,
ename text,
job text,
mgr int,
hiredate text,
sal varint,
comm varint,
deptno int
);
cqlsh:scott> CREATE TABLE emp(
... empno int PRIMARY KEY,
... ename text,
... job text,
... mgr int,
... hiredate text,
... sal varint,
... comm varint,
... deptno int
... );
cqlsh:scott> DESCRIBE COLUMNFAMILIES;
emp
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+----------+-----+-----+-----
(0 rows)
INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7369,'SMITH','CLERK',7902,'17-DEC-80',800,null,20);
INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7499,'ALLEN','SALESMAN',7698,'20-FEB-81',1600,300,30);
INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7902,'FORD','ANALYST',7566,'03-DEC-81',3000,null,20);
cqlsh:scott> INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno
... ) values(7369,'SMITH','CLERK',7902,'17-DEC-80',800,null,20);
cqlsh:scott> INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno
... ) values(7499,'ALLEN','SALESMAN',7698,'20-FEB-81',1600,300,30);
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-----------+----------+------+------
7499 | 300 | 30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
7369 | null | 20 | SMITH | 17-DEC-80 | CLERK | 7902 | 800
(2 rows)
cqlsh:scott> select empno, count(*) from emp group by empno;
empno | count
-------+-------
7499 | 1
7369 | 1
(2 rows)
Warnings :
Aggregation query used without partition key
cqlsh:scott> update emp set comm=100 where empno=7369; --delete column
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-----------+----------+------+------
7499 | 300 | 30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
7369 | 100 | 20 | SMITH | 17-DEC-80 | CLERK | 7902 | 800
(2 rows)
cqlsh:scott> DELETE comm FROM emp WHERE empno=7369;
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-----------+----------+------+------
7499 | 300 | 30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
7369 | null | 20 | SMITH | 17-DEC-80 | CLERK | 7902 | 800
(2 rows)
cqlsh:scott> delete from emp where empno=7369; --delete entire row
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-----------+----------+------+------
7499 | 300 | 30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
(1 rows)
cqlsh:scott> truncate table emp;
cqlsh:scott> CREATE INDEX idx_ename ON emp (ename);
cqlsh:scott> drop index idx_ename;
cqlsh:scott> ALTER TABLE emp ADD email text;
cqlsh:scott> select * from emp;
empno | comm | deptno | email | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-------+----------+-----+-----+-----
(0 rows)
cqlsh:scott> ALTER TABLE emp DROP email;
cqlsh:scott> drop table emp;
-- User defined type UDT
CREATE TYPE phone (
country_code int,
number text
)
cqlsh:scott> CREATE TYPE phone (
... country_code int,
... number text
... );
cqlsh:scott> describe types
phone
cqlsh:scott> ALTER TABLE emp ADD phonenum phone;
cqlsh:scott> select * from emp;
empno | comm | deptno | ename | hiredate | job | mgr | phonenum | sal
-------+------+--------+-------+-----------+----------+------+----------+------
7902 | null | 20 | FORD | 03-DEC-81 | ANALYST | 7566 | null | 3000
7499 | 300 | 30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | null | 1600
7369 | null | 20 | SMITH | 17-DEC-80 | CLERK | 7902 | null | 800
(3 rows)
cqlsh:scott> select empno,ename,phonenum from emp;
empno | ename | phonenum
-------+-------+-------------------------------------------
7902 | FORD | null
7499 | ALLEN | null
7369 | SMITH | {country_code: 1, number: '202 456-1111'}
-- Select data as JSON
cqlsh:scott> select json ename,job from emp;[json]
---------------------------------------
{"ename": "FORD", "job": "ANALYST"}
{"ename": "ALLEN", "job": "SALESMAN"}
{"ename": "SMITH", "job": "CLERK"}
(3 rows)
Using ODBC Driver
Dowload Cassandra ODBC driver from below link, install and configure. Then use in your desired application
https://academy.datastax.com/downloads/download-drivers
Test Failed, investigate the reason
Cassandra Ports
- 7199 - JMX (was 8080 pre Cassandra 0.8.xx)
- 7000 - Internode communication (not used if TLS enabled) (gossip/replication/proxied queries/etc)
- 7001 - TLS Internode communication (used if TLS enabled)
- 9160 - Thrift client API
- 9042 - CQL native transport port
[root@dn04 ~]# netstat -tanp | grep LISTEN
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 3256/Xvnc
tcp 0 0 127.0.0.1:9042 0.0.0.0:* LISTEN 20153/java
...
9042 port is listening for localhost (127.0.0.1), go to /etc/cassandra/default.conf/cassandra.yaml
find "rpc_address:" change its value to dn04 (name of the server where cassandra is running)
restart cassandra service and try again to establish the connection
[root@dn04 ~]# service cassandra restart
Restarting cassandra (via systemctl): [ OK ]
After the configuration changes, you will need to put the server name or IP while connecting with CQLSH.
[root@dn04 ~]# cqlsh dn04
cqlsh> show host
Connected to Test Cluster at dn04:9042.
How data is stored and read?
How data is stored and read?
At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each node in the cluster is responsible for and assigned a token range. A token in Cassandra is a Hash value.
When you try to insert data into Cassandra, it will use an algorithm to hash the primary key (which is combination of partition key and clustering column of the table). The token range for data is 0 – 2^127. Every node in a Cassandra cluster or “ring” is given an initial token. This initial token defines the end of the range a node is responsible for.
For example consider token range of 1 - 100, if you have 4 nodes in the Cassandra cluster then each node will have a initial token Node1 = 25, Node2 = 50, Node3 = 75 and Node4 = 100. So data which has a hash value of 1 – 25 will be inserted in Node1, data which has a hash value of 26 - 50 will be inserted in Node2 and so on.
Client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers. When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.
The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured partitioner and replica placement strategy.
The coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line.
[cass@dn01 ~]$ nodetool ring > /tmp/cassRingToken.txt
cqlsh:scott> select token(empno), empno,ename from emp;
system.token(empno) | empno | ename
----------------------+-------+-------
-8670174067668179189 | 7902 | FORD
-1144048224861957591 | 7499 | ALLEN
2617034212096716347 | 7369 | SMITH
(3 rows)
Search the cassRingToken.txt for the related token to see which node is responsible for this token or even easier, you can use nodetool getendpoints to see this data:
[cass@dn01 ~]$ nodetool getendpoints
nodetool: getendpoints requires keyspace, table and partition key arguments
See 'nodetool help' or 'nodetool help <command>'.
[cass@dn01 ~]$ nodetool getendpoints scott emp 7369
192.168.49.135
192.168.49.136
No comments:
Post a Comment