Data¶
-
class
cornac.data.
FeatureModule
(features=None, ids=None, normalized=False, **kwargs)[source]¶ Parameters: - features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.
-
batch_feature
(batch_ids)[source]¶ Return a matrix (batch of feature vectors) corresponding to provided batch_ids
-
build
(id_map=None)[source]¶ Build the feature matrix. Features will be swapped if the id_map is provided
-
feature_dim
¶ Return the dimensionality of the feature vectors
-
features
¶ Return the whole feature matrix
-
class
cornac.data.
TextModule
(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]¶ Text module
Parameters: - corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
- tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
- max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
- stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
-
batch_seq
(batch_ids, max_length=None)[source]¶ Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.
-
class
cornac.data.
ImageModule
(**kwargs)[source]¶ Image module
-
class
cornac.data.
GraphModule
(**kwargs)[source]¶ Graph module
-
batch
(batch_ids)[source]¶ Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.
Parameters: batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
-
-
class
cornac.data.
TrainSet
(uid_map, iid_map)[source]¶ Training Set
Parameters: - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
static
idx_iter
(idx_range, batch_size=1, shuffle=False)[source]¶ Create an iterator over batch of indices
Parameters: Returns: iterator
Return type: batch of indices (array of np.int)
-
iid_list
¶ Return the list of mapped item ids
-
num_items
¶ Return the number of items
-
num_users
¶ Return the number of users
-
raw_iid_list
¶ Return the list of raw item ids
-
raw_uid_list
¶ Return the list of raw user ids
-
uid_list
¶ Return the list of mapped user ids
- uid_map (
-
class
cornac.data.
MatrixTrainSet
(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]¶ Training set contains preference matrix
Parameters: - matrix (
scipy.sparse.csr_matrix
) – Preferences in the form of scipy sparse matrix. - max_rating (float) – Maximum value of the preferences.
- min_rating (float) – Minimum value of the preferences.
- global_mean (float) – Average value of the preferences.
- uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
classmethod
from_uir
(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]¶ Constructing TrainSet from triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- global_uid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. - global_ui_set (
set
, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations. - verbose (bool, default: False) – The verbosity flag.
Returns: train_set – MatrixTrainSet object.
Return type: <cornac.data.MatrixTrainSet>
-
item_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over item ids
Parameters: Returns: iterator
Return type: batch of item ids (array of np.int)
-
uij_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over data yielding batch of users, positive items, and negative items
Parameters: Returns: iterator – batch of negative items (array of np.int)
Return type: batch of users (array of np.int), batch of positive items (array of np.int),
-
uir_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over data yielding batch of users, items, and rating values
Parameters: Returns: iterator – batch of ratings (array of np.float)
Return type: batch of users (array of np.int), batch of items (array of np.int),
- matrix (
-
class
cornac.data.
MultimodalTrainSet
(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]¶ Multimodal training set
Parameters: - matrix (
scipy.sparse.csr_matrix
) – Preferences in the form of scipy sparse matrix. - max_rating (float) – Maximum value of the preferences.
- min_rating (float) – Minimum value of the preferences.
- global_mean (float) – Average value of the preferences.
- uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
- matrix (
-
class
cornac.data.
TestSet
(user_ratings, uid_map, iid_map)[source]¶ Test Set
Parameters: - user_ratings (
defaultdict
oflist
) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
classmethod
from_uir
(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]¶ Constructing TestSet from triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- global_uid_map (
defaultdict
) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
) – The dictionary containing global mapping from original ids to mapped ids of items. - global_ui_set (
set
) – The global set of tuples (user, item). This helps avoiding duplicate observations. - verbose (bool, default: False) – The verbosity flag.
Returns: test_set – TestSet object.
Return type: <cornac.data.TestSet>
-
users
¶ Return a list of users
- user_ratings (
-
class
cornac.data.
MultimodalTestSet
(user_ratings, uid_map, iid_map, **kwargs)[source]¶ Test Set
Parameters: - user_ratings (
defaultdict
oflist
) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
- user_ratings (
Train Set¶
@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>
-
class
cornac.data.trainset.
MatrixTrainSet
(matrix, max_rating, min_rating, global_mean, uid_map, iid_map)[source]¶ Training set contains preference matrix
Parameters: - matrix (
scipy.sparse.csr_matrix
) – Preferences in the form of scipy sparse matrix. - max_rating (float) – Maximum value of the preferences.
- min_rating (float) – Minimum value of the preferences.
- global_mean (float) – Average value of the preferences.
- uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
classmethod
from_uir
(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]¶ Constructing TrainSet from triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- global_uid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. - global_ui_set (
set
, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations. - verbose (bool, default: False) – The verbosity flag.
Returns: train_set – MatrixTrainSet object.
Return type: <cornac.data.MatrixTrainSet>
-
item_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over item ids
Parameters: Returns: iterator
Return type: batch of item ids (array of np.int)
-
uij_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over data yielding batch of users, positive items, and negative items
Parameters: Returns: iterator – batch of negative items (array of np.int)
Return type: batch of users (array of np.int), batch of positive items (array of np.int),
-
uir_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over data yielding batch of users, items, and rating values
Parameters: Returns: iterator – batch of ratings (array of np.float)
Return type: batch of users (array of np.int), batch of items (array of np.int),
- matrix (
-
class
cornac.data.trainset.
MultimodalTrainSet
(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]¶ Multimodal training set
Parameters: - matrix (
scipy.sparse.csr_matrix
) – Preferences in the form of scipy sparse matrix. - max_rating (float) – Maximum value of the preferences.
- min_rating (float) – Minimum value of the preferences.
- global_mean (float) – Average value of the preferences.
- uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
- matrix (
-
class
cornac.data.trainset.
TrainSet
(uid_map, iid_map)[source]¶ Training Set
Parameters: - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
static
idx_iter
(idx_range, batch_size=1, shuffle=False)[source]¶ Create an iterator over batch of indices
Parameters: Returns: iterator
Return type: batch of indices (array of np.int)
-
iid_list
¶ Return the list of mapped item ids
-
num_items
¶ Return the number of items
-
num_users
¶ Return the number of users
-
raw_iid_list
¶ Return the list of raw item ids
-
raw_uid_list
¶ Return the list of raw user ids
-
uid_list
¶ Return the list of mapped user ids
- uid_map (
Test Set¶
@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>
-
class
cornac.data.testset.
MultimodalTestSet
(user_ratings, uid_map, iid_map, **kwargs)[source]¶ Test Set
Parameters: - user_ratings (
defaultdict
oflist
) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
- user_ratings (
-
class
cornac.data.testset.
TestSet
(user_ratings, uid_map, iid_map)[source]¶ Test Set
Parameters: - user_ratings (
defaultdict
oflist
) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids. - uid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of users. - iid_map (
defaultdict
) – The dictionary containing mapping from original ids to mapped ids of items.
-
classmethod
from_uir
(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]¶ Constructing TestSet from triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- global_uid_map (
defaultdict
) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
) – The dictionary containing global mapping from original ids to mapped ids of items. - global_ui_set (
set
) – The global set of tuples (user, item). This helps avoiding duplicate observations. - verbose (bool, default: False) – The verbosity flag.
Returns: test_set – TestSet object.
Return type: <cornac.data.TestSet>
-
users
¶ Return a list of users
- user_ratings (
Graph Module¶
@author: Aghiles Salah <asalah@smu.edu.sg>
-
class
cornac.data.graph.
GraphModule
(**kwargs)[source]¶ Graph module
-
batch
(batch_ids)[source]¶ Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.
Parameters: batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
-
Text Module¶
@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>
-
class
cornac.data.text.
Tokenizer
[source]¶ Generic class for other subclasses to extend from. This typically either splits text into word tokens or character tokens.
-
class
cornac.data.text.
BaseTokenizer
(sep: str = ' ', pre_rules: List[Callable[str, str]] = None, stop_words: Union[List, str] = None)[source]¶ A base tokenizer use a provided delimiter sep to split text.
-
class
cornac.data.text.
Vocabulary
(idx2tok: List[str], use_special_tokens: bool = False)[source]¶ Vocabulary basically contains mapping between numbers and tokens and vice versa.
-
classmethod
from_sequences
(sequences: List[List[str]], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]¶ Build a vocabulary from sequences (list of list of tokens).
-
classmethod
-
class
cornac.data.text.
CountVectorizer
(tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, max_features: int = None, stop_words: Union[List, str] = None, binary: bool = False)[source]¶ Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
Parameters: - tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
- max_features (int, default=None) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. If vocab is not None, this will be ignored.
- stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
- binary (boolean, default=False) – If True, all non zero counts are set to 1.
-
fit
(raw_documents: List[str]) → cornac.data.text.CountVectorizer[source]¶ Build a vocabulary of all tokens in the raw documents.
Parameters: raw_documents (iterable) – An iterable which yields either str, unicode or file objects. Returns: Return type: self
-
fit_transform
(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶ Build the vocabulary and return term-document matrix.
Parameters: raw_documents (List[str]) – Returns: - sequences: List[List[str]
- Tokenized sequences of raw_documents
- X: array, [n_samples, n_features]
- Document-term matrix.
Return type: (sequences, X)
-
transform
(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶ Transform documents to document-term matrix.
Parameters: raw_documents (List[str]) – Returns: - sequences: List[List[str]
- Tokenized sequences of raw_documents.
- X: array, [n_samples, n_features]
- Document-term matrix.
Return type: (sequences, X)
-
class
cornac.data.text.
TextModule
(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]¶ Text module
Parameters: - corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
- tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
- max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
- stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
-
batch_seq
(batch_ids, max_length=None)[source]¶ Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.
Image Module¶
@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>
Reader¶
@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>
-
cornac.data.reader.
read_ui
(fpath, value=1.0, sep='\t', skip_lines=0)[source]¶ Read data in the form of implicit feedback user-items. Each line starts with user id followed by multiple of item ids.
Parameters: Returns: triplets – Data in the form of list of tuples of (user, item, 1).
Return type: iterable