pycassa.columnfamily – Column Family

Provides an abstraction of Cassandra’s data model to allow for easy manipulation of data inside Cassandra.

static columnfamily.gm_timestamp()

Returns the number of microseconds since the Unix Epoch.

class pycassa.columnfamily.ColumnFamily(pool, column_family)

An abstraction of a Cassandra column family or super column family. Operations on this, such as get() or insert() will get data from or insert data into the corresponding Cassandra column family.

pool is a ConnectionPool that the column family will use for all operations. A connection is drawn from the pool before each operations and is returned afterwards.

column_family should be the name of the column family that you want to use in Cassandra. Note that the keyspace to be used is determined by the pool.

read_consistency_level = 1

The default consistency level for every read operation, such as get() or get_range(). This may be overridden per-operation. This should be an instance of ConsistencyLevel. The default level is ONE.

write_consistency_level = 1

The default consistency level for every write operation, such as insert() or remove(). This may be overridden per-operation. This should be an instance of ConsistencyLevel. The default level is ONE.

autopack_names = True

Controls whether column names are automatically converted to or from their natural type to the binary string format that Cassandra uses. The data type used is controlled by column_name_class for column names and super_column_name_class for super column names. By default, this is True.

autopack_values = True

Whether column values are automatically converted to or from their natural type to the binary string format that Cassandra uses. The data type used is controlled by default_validation_class and column_validators. By default, this is True.

autopack_keys = True

Whether row keys are automatically converted to or from their natural type to the binary string format that Cassandra uses. The data type used is controlled by key_validation_class. By default, this is True.

column_name_class

The data type of column names, which pycassa will use to determine how to pack and unpack them.

This is set automatically by inspecting the column family’s comparator_type, but it may also be set manually if you want autopacking behavior without setting a comparator_type. Options include an instance of any class in pycassa.types, such as LongType().

super_column_name_class

Like column_name_class, but for super column names.

default_validation_class

The default data type of column values, which pycassa will use to determine how to pack and unpack them.

This is set automatically by inspecting the column family’s default_validation_class, but it may also be set manually if you want autopacking behavior without setting a default_validation_class. Options include an instance of any class in pycassa.types, such as LongType().

column_validators

Like default_validation_class, but is a dict mapping individual columns to types.

key_validation_class

The data type of row keys, which pycassa will use to determine how to pack and unpack them.

This is set automatically by inspecting the column family’s key_validation_class (which only exists in Cassandra 0.8 or greater), but may be set manually if you want the autopacking behavior without setting a key_validation_class or if you are using Cassandra 0.7. Options include an instance of any class in pycassa.types, such as LongType().

dict_class = <class 'collections.OrderedDict'>

Results are returned as dictionaries. By default, python 2.7’s collections.OrderedDict is used if available, otherwise OrderedDict is used so that order is maintained. A different class, such as dict, may be instead by used setting this.

buffer_size = 1024

When calling get_range() or get_indexed_slices(), the intermediate results need to be buffered if we are fetching many rows, otherwise performance may suffer and the Cassandra server may overallocate memory and fail. This is the size of that buffer in number of rows. The default is 1024.

column_buffer_size = 1024

The number of columns fetched at once for xget()

timestamp = <unbound method ColumnFamily.gm_timestamp>

Each insert() or remove() sends a timestamp with every column. This attribute is a function that is used to get this timestamp when needed. The default function is gm_timestamp().

load_schema()

Loads the schema definition for this column family from Cassandra and updates comparator and validation classes if neccessary.

get(key[, columns][, column_start][, column_finish][, column_reversed][, column_count][, include_timestamp][, super_column][, read_consistency_level])

Fetches all or part of the row with key key.

The columns fetched may be limited to a specified list of column names using columns.

Alternatively, you may fetch a slice of columns or super columns from a row using column_start, column_finish, and column_count. Setting these will cause columns or super columns to be fetched starting with column_start, continuing until column_count columns or super columns have been fetched or column_finish is reached. If column_start is left as the empty string, the slice will begin with the start of the row; leaving column_finish blank will cause the slice to extend to the end of the row. Note that column_count defaults to 100, so rows over this size will not be completely fetched by default.

If column_reversed is True, columns are fetched in reverse sorted order, beginning with column_start. In this case, if column_start is the empty string, the slice will begin with the end of the row.

You may fetch all or part of only a single super column by setting super_column. If this is set, column_start, column_finish, column_count, and column_reversed will apply to the subcolumns of super_column.

To include every column’s timestamp in the result set, set include_timestamp to True. Results will include a (value, timestamp) tuple for each column.

To include every column’s ttl in the result set, set include_ttl to True. Results will include a (value, ttl) tuple for each column.

If this is a standard column family, the return type is of the form {column_name: column_value}. If this is a super column family and super_column is not specified, the results are of the form {super_column_name: {column_name, column_value}}. If super_column is set, the super column name will be excluded and the results are of the form {column_name: column_value}.

multiget(keys[, columns][, column_start][, column_finish][, column_reversed][, column_count][, include_timestamp][, super_column][, read_consistency_level][, buffer_size])

Fetch multiple rows from a Cassandra server.

keys should be a list of keys to fetch.

buffer_size is the number of rows from the total list to fetch at a time. If left as None, the ColumnFamily’s buffer_size will be used.

All other parameters are the same as get(), except that a list of keys may be passed in.

Results will be returned in the form: {key: {column_name: column_value}}. If an OrderedDict is used, the rows will have the same order as keys.

xget(key[, column_start][, column_finish][, column_reversed][, column_count][, include_timestamp][, read_consistency_level][, buffer_size])

Like get(), but creates a generator that pages over the columns automatically.

The number of columns fetched at once can be controlled with the buffer_size parameter. The default is column_buffer_size.

The generator returns (name, value) tuples.

get_count(key[, super_column][, columns][, column_start][, column_finish][, super_column][, read_consistency_level][, column_reversed][, max_count])

Count the number of columns in the row with key key.

You may limit the columns or super columns counted to those in columns. Additionally, you may limit the columns or super columns counted to only those between column_start and column_finish.

You may also count only the number of subcolumns in a single super column using super_column. If this is set, columns, column_start, and column_finish only apply to the subcolumns of super_column.

To put an upper bound on the number of columns that are counted, set max_count.

multiget_count(key[, super_column][, columns][, column_start][, column_finish][, super_column][, read_consistency_level][, buffer_size][, column_reversed][, max_count])

Perform a column count in parallel on a set of rows.

The parameters are the same as for multiget(), except that a list of keys may be used. A dictionary of the form {key: int} is returned.

buffer_size is the number of rows from the total list to count at a time. If left as None, the ColumnFamily’s buffer_size will be used.

To put an upper bound on the number of columns that are counted, set max_count.

get_range([start][, finish][, columns][, column_start][, column_finish][, column_reversed][, column_count][, row_count][, include_timestamp][, super_column][, read_consistency_level][, buffer_size][, filter_empty])

Get an iterator over rows in a specified key range.

The key range begins with start and ends with finish. If left as empty strings, these extend to the beginning and end, respectively. Note that if RandomPartitioner is used, rows are stored in the order of the MD5 hash of their keys, so getting a lexicographical range of keys is not feasible.

In place of start and finish, you may use start_token and finish_token or a combination of start and finish_token. In this case, you are specifying a token range to fetch instead of a key range. This can be useful for fetching all data owned by a node or for parallelizing a full data set scan. Otherwise, you should typically just use start and finish. When using RandomPartitioner or Murmur3Partitioner, start_token and finish_token should be string versions of the numeric tokens; for ByteOrderedPartitioner, they should be hex-encoded string versions of the token.

The row_count parameter limits the total number of rows that may be returned. If left as None, the number of rows that may be returned is unlimted (this is the default).

When calling get_range(), the intermediate results need to be buffered if we are fetching many rows, otherwise the Cassandra server will overallocate memory and fail. buffer_size is the size of that buffer in number of rows. If left as None, the ColumnFamily’s buffer_size attribute will be used.

When filter_empty is left as True, empty rows (including range ghosts) will be skipped and will not count towards row_count.

All other parameters are the same as those of get().

A generator over (key, {column_name: column_value}) is returned. To convert this to a list, use list() on the result.

get_indexed_slices(index_clause[, columns][, column_start][, column_finish][, column_reversed][, column_count][, include_timestamp][, read_consistency_level][, buffer_size])

Similar to get_range(), but an IndexClause is used instead of a key range.

index_clause limits the keys that are returned based on expressions that compare the value of a column to a given value. At least one of the expressions in the IndexClause must be on an indexed column.

Note that Cassandra does not support secondary indexes or get_indexed_slices() for super column families.

insert(key, columns[, timestamp][, ttl][, write_consistency_level])

Insert or update columns in the row with key key.

columns should be a dictionary of columns or super columns to insert or update. If this is a standard column family, columns should look like {column_name: column_value}. If this is a super column family, columns should look like {super_column_name: {sub_column_name: value}}. If this is a counter column family, you may use integers as values and those will be used as counter adjustments.

A timestamp may be supplied for all inserted columns with timestamp.

ttl sets the “time to live” in number of seconds for the inserted columns. After this many seconds, Cassandra will mark the columns as deleted.

The timestamp Cassandra reports as being used for insert is returned.

batch_insert(rows[, timestamp][, ttl][, write_consistency_level])

Like insert(), but multiple rows may be inserted at once.

The rows parameter should be of the form {key: {column_name: column_value}} if this is a standard column family or {key: {super_column_name: {column_name: column_value}}} if this is a super column family.

add(key, column[, value][, super_column][, write_consistency_level])

Increment or decrement a counter.

value should be an integer, either positive or negative, to be added to a counter column. By default, value is 1.

New in version 1.1.0: Available in Cassandra 0.8.0 and later.

remove(key[, columns][, super_column][, write_consistency_level])

Remove a specified row or a set of columns within the row with key key.

A set of columns or super columns to delete may be specified using columns.

A single super column may be deleted by setting super_column. If super_column is specified, columns will apply to the subcolumns of super_column.

If columns and super_column are both None, the entire row is removed.

The timestamp used for the mutation is returned.

remove_counter(key, column[, super_column][, write_consistency_level])

Remove a counter at the specified location.

Note that counters have limited support for deletes: if you remove a counter, you must wait to issue any following update until the delete has reached all the nodes and all of them have been fully compacted.

New in version 1.1.0: Available in Cassandra 0.8.0 and later.

truncate()

Marks the entire ColumnFamily as deleted.

From the user’s perspective, a successful call to truncate will result complete data deletion from this column family. Internally, however, disk space will not be immediatily released, as with all deletes in Cassandra, this one only marks the data as deleted.

The operation succeeds only if all hosts in the cluster at available and will throw an UnavailableException if some hosts are down.

batch(self[, queue_size][, write_consistency_level])

Create batch mutator for doing multiple insert, update, and remove operations using as few roundtrips as possible.

The queue_size parameter sets the max number of mutations per request.

A CfMutator is returned.

Previous topic

pycassa.pool – Connection Pooling

Next topic

pycassa.columnfamilymap – Maps Classes to Column Families

This Page