Data¶

class cornac.data.FeatureModule(features=None, ids=None, normalized=False, **kwargs)[source]¶

Parameters:	features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids. ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.

batch_feature(batch_ids)[source]¶: Return a matrix (batch of feature vectors) corresponding to provided batch_ids

build(id_map=None)[source]¶: Build the feature matrix. Features will be swapped if the id_map is provided

feature_dim¶: Return the dimensionality of the feature vectors

features¶: Return the whole feature matrix

class cornac.data.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]¶

Text module

Parameters:

corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.

batch_seq(batch_ids, max_length=None)[source]¶: Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]¶: Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]¶: Build the model based on provided list of ordered ids

class cornac.data.ImageModule(**kwargs)[source]¶

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]¶: Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]¶: Build the model based on provided list of ordered ids

class cornac.data.GraphModule(**kwargs)[source]¶

Graph module

batch(batch_ids)[source]¶

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:	batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.

build(id_map=None)[source]¶: Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]¶: Get the training tuples

class cornac.data.TrainSet(uid_map, iid_map)[source]¶

Training Set

Parameters:	uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

get_iid(raw_iid)[source]¶: Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]¶: Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]¶

Create an iterator over batch of indices

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of indices (array of np.int)

iid_list¶: Return the list of mapped item ids

is_unk_item(mapped_iid)[source]¶: Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]¶: Return whether or not a user is unknown given the mapped id

num_items¶: Return the number of items

num_users¶: Return the number of users

raw_iid_list¶: Return the list of raw item ids

raw_uid_list¶: Return the list of raw user ids

uid_list¶: Return the list of mapped user ids

class cornac.data.MatrixTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]¶

Training set contains preference matrix

Parameters:

matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
max_rating (float) – Maximum value of the preferences.
min_rating (float) – Minimum value of the preferences.
global_mean (float) – Average value of the preferences.
uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.

classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]¶

Constructing TrainSet from triplet data.

Parameters:	data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating) global_uid_map (`defaultdict`, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. global_iid_map (`defaultdict`, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. global_ui_set (`set`, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations. verbose (bool, default: False) – The verbosity flag.
Returns:	train_set – MatrixTrainSet object.
Return type:	`<cornac.data.MatrixTrainSet>`

item_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over item ids

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of item ids (array of np.int)

uij_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator – batch of negative items (array of np.int)
Return type:	batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over data yielding batch of users, items, and rating values

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator – batch of ratings (array of np.float)
Return type:	batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over user ids

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of user ids (array of np.int)

class cornac.data.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]¶

Multimodal training set

Parameters:

matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
max_rating (float) – Maximum value of the preferences.
min_rating (float) – Minimum value of the preferences.
global_mean (float) – Average value of the preferences.
uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.

class cornac.data.TestSet(user_ratings, uid_map, iid_map)[source]¶

Test Set

Parameters:	user_ratings (`defaultdict` of `list`) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]¶

Constructing TestSet from triplet data.

Parameters:	data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating) global_uid_map (`defaultdict`) – The dictionary containing global mapping from original ids to mapped ids of users. global_iid_map (`defaultdict`) – The dictionary containing global mapping from original ids to mapped ids of items. global_ui_set (`set`) – The global set of tuples (user, item). This helps avoiding duplicate observations. verbose (bool, default: False) – The verbosity flag.
Returns:	test_set – TestSet object.
Return type:	`<cornac.data.TestSet>`

get_iid(raw_iid)[source]¶: Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]¶: Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]¶: Return the mapped id of a user given a raw id

users¶: Return a list of users

class cornac.data.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]¶

Test Set

Parameters:	user_ratings (`defaultdict` of `list`) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

Train Set¶

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.trainset.MatrixTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]¶

Training set contains preference matrix

Parameters:

matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
max_rating (float) – Maximum value of the preferences.
min_rating (float) – Minimum value of the preferences.
global_mean (float) – Average value of the preferences.
uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.

classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]¶

Constructing TrainSet from triplet data.

Parameters:	data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating) global_uid_map (`defaultdict`, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. global_iid_map (`defaultdict`, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. global_ui_set (`set`, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations. verbose (bool, default: False) – The verbosity flag.
Returns:	train_set – MatrixTrainSet object.
Return type:	`<cornac.data.MatrixTrainSet>`

item_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over item ids

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of item ids (array of np.int)

uij_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator – batch of negative items (array of np.int)
Return type:	batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over data yielding batch of users, items, and rating values

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator – batch of ratings (array of np.float)
Return type:	batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]¶

Create an iterator over user ids

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of user ids (array of np.int)

class cornac.data.trainset.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]¶

Multimodal training set

Parameters:

matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
max_rating (float) – Maximum value of the preferences.
min_rating (float) – Minimum value of the preferences.
global_mean (float) – Average value of the preferences.
uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.

class cornac.data.trainset.TrainSet(uid_map, iid_map)[source]¶

Training Set

Parameters:	uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

get_iid(raw_iid)[source]¶: Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]¶: Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]¶

Create an iterator over batch of indices

Parameters:	batch_size (int, optional, default = 1) – shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:	iterator
Return type:	batch of indices (array of np.int)

iid_list¶: Return the list of mapped item ids

is_unk_item(mapped_iid)[source]¶: Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]¶: Return whether or not a user is unknown given the mapped id

num_items¶: Return the number of items

num_users¶: Return the number of users

raw_iid_list¶: Return the list of raw item ids

raw_uid_list¶: Return the list of raw user ids

uid_list¶: Return the list of mapped user ids

Test Set¶

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.testset.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]¶

Test Set

Parameters:	user_ratings (`defaultdict` of `list`) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

class cornac.data.testset.TestSet(user_ratings, uid_map, iid_map)[source]¶

Test Set

Parameters:	user_ratings (`defaultdict` of `list`) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. uid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of users. iid_map (`defaultdict`) – The dictionary containing mapping from original ids to mapped ids of items.

classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]¶

Constructing TestSet from triplet data.

Parameters:	data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating) global_uid_map (`defaultdict`) – The dictionary containing global mapping from original ids to mapped ids of users. global_iid_map (`defaultdict`) – The dictionary containing global mapping from original ids to mapped ids of items. global_ui_set (`set`) – The global set of tuples (user, item). This helps avoiding duplicate observations. verbose (bool, default: False) – The verbosity flag.
Returns:	test_set – TestSet object.
Return type:	`<cornac.data.TestSet>`

get_iid(raw_iid)[source]¶: Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]¶: Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]¶: Return the mapped id of a user given a raw id

users¶: Return a list of users

Graph Module¶

@author: Aghiles Salah <asalah@smu.edu.sg>

class cornac.data.graph.GraphModule(**kwargs)[source]¶

Graph module

batch(batch_ids)[source]¶

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:	batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.

build(id_map=None)[source]¶: Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]¶: Get the training tuples

Text Module¶

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.text.Tokenizer[source]¶

Generic class for other subclasses to extend from. This typically either splits text into word tokens or character tokens.

batch_tokenize(texts: List[str]) → List[List[str]][source]¶

Splitting a corpus with multiple text documents.

Returns:	tokens
Return type:	`List[List[str]]`

tokenize(t: str) → List[str][source]¶

Splitting text into tokens.

Returns:	tokens
Return type:	`List[str]`

class cornac.data.text.BaseTokenizer(sep: str = ' ', pre_rules: List[Callable[str, str]] = None, stop_words: Union[List, str] = None)[source]¶

A base tokenizer use a provided delimiter sep to split text.

batch_tokenize(texts: List[str]) → List[List[str]][source]¶

Splitting a corpus with multiple text documents.

Returns:	tokens
Return type:	`List[List[str]]`

tokenize(t: str) → List[str][source]¶

Splitting text into tokens.

Returns:	tokens
Return type:	`List[str]`

class cornac.data.text.Vocabulary(idx2tok: List[str], use_special_tokens: bool = False)[source]¶

Vocabulary basically contains mapping between numbers and tokens and vice versa.

classmethod from_sequences(sequences: List[List[str]], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]¶: Build a vocabulary from sequences (list of list of tokens).

classmethod from_tokens(tokens: List[str], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]¶: Build a vocabulary from list of tokens.

classmethod load(path)[source]¶: Load a vocabulary from path to a pickle file.

save(path)[source]¶: Save idx2tok into a pickle file.

to_idx(tokens: List[str]) → List[int][source]¶: Convert a list of tokens to their integer indices.

to_text(indices: List[int], sep=' ') → List[str][source]¶: Convert a list of integer indices to their tokens.

class cornac.data.text.CountVectorizer(tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, max_features: int = None, stop_words: Union[List, str] = None, binary: bool = False)[source]¶

Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Parameters:

tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
max_features (int, default=None) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. If vocab is not None, this will be ignored.
stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
binary (boolean, default=False) – If True, all non zero counts are set to 1.

fit(raw_documents: List[str]) → cornac.data.text.CountVectorizer[source]¶

Build a vocabulary of all tokens in the raw documents.

Parameters:	raw_documents (iterable) – An iterable which yields either str, unicode or file objects.
Returns:
Return type:	self

fit_transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶

Build the vocabulary and return term-document matrix.

Parameters:	raw_documents (List[str]) –
Returns:	sequences: List[List[str] Tokenized sequences of raw_documents X: array, [n_samples, n_features] Document-term matrix.
Return type:	(sequences, X)

transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶

Transform documents to document-term matrix.

Parameters:	raw_documents (List[str]) –
Returns:	sequences: List[List[str] Tokenized sequences of raw_documents. X: array, [n_samples, n_features] Document-term matrix.
Return type:	(sequences, X)

class cornac.data.text.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]¶

Text module

Parameters:

corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.

batch_seq(batch_ids, max_length=None)[source]¶: Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]¶: Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]¶: Build the model based on provided list of ordered ids

Image Module¶

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.image.ImageModule(**kwargs)[source]¶

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]¶: Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]¶: Build the model based on provided list of ordered ids

Reader¶

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

cornac.data.reader.read_ui(fpath, value=1.0, sep='\t', skip_lines=0)[source]¶

Read data in the form of implicit feedback user-items. Each line starts with user id followed by multiple of item ids.

Parameters:	fpath (str) – Path to the data file value (float, default: 1.0) – Value for the feedback sep (str, default:) – The delimiter string. skip_lines (int, default: 0) – Number of first lines to skip
Returns:	triplets – Data in the form of list of tuples of (user, item, 1).
Return type:	`iterable`

cornac.data.reader.read_uir(fpath, u_col=0, i_col=1, r_col=2, sep='\t', skip_lines=0)[source]¶

Read data in the form of triplets (user, item, rating).

Parameters:	fpath (str) – Path to the data file u_col (int, default: 0) – Index of the user column i_col (int, default: 1) – Index of the item column r_col (int, default: 2) – Index of the rating column sep (str, default:) – The delimiter string. skip_lines (int, default: 0) – Number of first lines to skip
Returns:	triplets – Data in the form of list of tuples of (user, item, rating).
Return type:	`iterable`