Data

class cornac.data.FeatureModule(features=None, ids=None, normalized=False, **kwargs)[source]
Parameters:
  • features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.
batch_feature(batch_ids)[source]

Return a matrix (batch of feature vectors) corresponding to provided batch_ids

build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

feature_dim

Return the dimensionality of the feature vectors

features

Return the whole feature matrix

class cornac.data.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]

Text module

Parameters:
  • corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
batch_seq(batch_ids, max_length=None)[source]

Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]

Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

class cornac.data.ImageModule(**kwargs)[source]

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]

Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

class cornac.data.GraphModule(**kwargs)[source]

Graph module

batch(batch_ids)[source]

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]

Get the training tuples

class cornac.data.TrainSet(uid_map, iid_map)[source]

Training Set

Parameters:
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]

Create an iterator over batch of indices

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of indices (array of np.int)

iid_list

Return the list of mapped item ids

is_unk_item(mapped_iid)[source]

Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]

Return whether or not a user is unknown given the mapped id

num_items

Return the number of items

num_users

Return the number of users

raw_iid_list

Return the list of raw item ids

raw_uid_list

Return the list of raw user ids

uid_list

Return the list of mapped user ids

class cornac.data.MatrixTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]

Training set contains preference matrix

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]

Constructing TrainSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

train_set – MatrixTrainSet object.

Return type:

<cornac.data.MatrixTrainSet>

item_iter(batch_size=1, shuffle=False)[source]

Create an iterator over item ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of item ids (array of np.int)

uij_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of negative items (array of np.int)

Return type:

batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, items, and rating values

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of ratings (array of np.float)

Return type:

batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]

Create an iterator over user ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of user ids (array of np.int)

class cornac.data.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]

Multimodal training set

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.TestSet(user_ratings, uid_map, iid_map)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]

Constructing TestSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

test_set – TestSet object.

Return type:

<cornac.data.TestSet>

get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]

Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

users

Return a list of users

class cornac.data.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.

Train Set

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.trainset.MatrixTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]

Training set contains preference matrix

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]

Constructing TrainSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

train_set – MatrixTrainSet object.

Return type:

<cornac.data.MatrixTrainSet>

item_iter(batch_size=1, shuffle=False)[source]

Create an iterator over item ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of item ids (array of np.int)

uij_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of negative items (array of np.int)

Return type:

batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, items, and rating values

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of ratings (array of np.float)

Return type:

batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]

Create an iterator over user ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of user ids (array of np.int)

class cornac.data.trainset.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]

Multimodal training set

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.trainset.TrainSet(uid_map, iid_map)[source]

Training Set

Parameters:
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]

Create an iterator over batch of indices

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of indices (array of np.int)

iid_list

Return the list of mapped item ids

is_unk_item(mapped_iid)[source]

Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]

Return whether or not a user is unknown given the mapped id

num_items

Return the number of items

num_users

Return the number of users

raw_iid_list

Return the list of raw item ids

raw_uid_list

Return the list of raw user ids

uid_list

Return the list of mapped user ids

Test Set

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.testset.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.testset.TestSet(user_ratings, uid_map, iid_map)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]

Constructing TestSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

test_set – TestSet object.

Return type:

<cornac.data.TestSet>

get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]

Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

users

Return a list of users

Graph Module

@author: Aghiles Salah <asalah@smu.edu.sg>

class cornac.data.graph.GraphModule(**kwargs)[source]

Graph module

batch(batch_ids)[source]

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]

Get the training tuples

Text Module

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.text.Tokenizer[source]

Generic class for other subclasses to extend from. This typically either splits text into word tokens or character tokens.

batch_tokenize(texts: List[str]) → List[List[str]][source]

Splitting a corpus with multiple text documents.

Returns:tokens
Return type:List[List[str]]
tokenize(t: str) → List[str][source]

Splitting text into tokens.

Returns:tokens
Return type:List[str]
class cornac.data.text.BaseTokenizer(sep: str = ' ', pre_rules: List[Callable[str, str]] = None, stop_words: Union[List, str] = None)[source]

A base tokenizer use a provided delimiter sep to split text.

batch_tokenize(texts: List[str]) → List[List[str]][source]

Splitting a corpus with multiple text documents.

Returns:tokens
Return type:List[List[str]]
tokenize(t: str) → List[str][source]

Splitting text into tokens.

Returns:tokens
Return type:List[str]
class cornac.data.text.Vocabulary(idx2tok: List[str], use_special_tokens: bool = False)[source]

Vocabulary basically contains mapping between numbers and tokens and vice versa.

classmethod from_sequences(sequences: List[List[str]], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]

Build a vocabulary from sequences (list of list of tokens).

classmethod from_tokens(tokens: List[str], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]

Build a vocabulary from list of tokens.

classmethod load(path)[source]

Load a vocabulary from path to a pickle file.

save(path)[source]

Save idx2tok into a pickle file.

to_idx(tokens: List[str]) → List[int][source]

Convert a list of tokens to their integer indices.

to_text(indices: List[int], sep=' ') → List[str][source]

Convert a list of integer indices to their tokens.

class cornac.data.text.CountVectorizer(tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, max_features: int = None, stop_words: Union[List, str] = None, binary: bool = False)[source]

Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Parameters:
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • max_features (int, default=None) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
  • binary (boolean, default=False) – If True, all non zero counts are set to 1.
fit(raw_documents: List[str]) → cornac.data.text.CountVectorizer[source]

Build a vocabulary of all tokens in the raw documents.

Parameters:raw_documents (iterable) – An iterable which yields either str, unicode or file objects.
Returns:
Return type:self
fit_transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]

Build the vocabulary and return term-document matrix.

Parameters:raw_documents (List[str]) –
Returns:
sequences: List[List[str]
Tokenized sequences of raw_documents
X: array, [n_samples, n_features]
Document-term matrix.
Return type:(sequences, X)
transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]

Transform documents to document-term matrix.

Parameters:raw_documents (List[str]) –
Returns:
sequences: List[List[str]
Tokenized sequences of raw_documents.
X: array, [n_samples, n_features]
Document-term matrix.
Return type:(sequences, X)
class cornac.data.text.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]

Text module

Parameters:
  • corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
batch_seq(batch_ids, max_length=None)[source]

Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]

Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

Image Module

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.image.ImageModule(**kwargs)[source]

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]

Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

Reader

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

cornac.data.reader.read_ui(fpath, value=1.0, sep='\t', skip_lines=0)[source]

Read data in the form of implicit feedback user-items. Each line starts with user id followed by multiple of item ids.

Parameters:
  • fpath (str) – Path to the data file
  • value (float, default: 1.0) – Value for the feedback
  • sep (str, default:) – The delimiter string.
  • skip_lines (int, default: 0) – Number of first lines to skip
Returns:

triplets – Data in the form of list of tuples of (user, item, 1).

Return type:

iterable

cornac.data.reader.read_uir(fpath, u_col=0, i_col=1, r_col=2, sep='\t', skip_lines=0)[source]

Read data in the form of triplets (user, item, rating).

Parameters:
  • fpath (str) – Path to the data file
  • u_col (int, default: 0) – Index of the user column
  • i_col (int, default: 1) – Index of the item column
  • r_col (int, default: 2) – Index of the rating column
  • sep (str, default:) – The delimiter string.
  • skip_lines (int, default: 0) – Number of first lines to skip
Returns:

triplets – Data in the form of list of tuples of (user, item, rating).

Return type:

iterable