Help on class ALS in module pyspark.ml.recommendation:
class ALS(pyspark.ml.wrapper.JavaEstimator, _ALSParams, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable)
| Alternating Least Squares (ALS) matrix factorization.
|
| ALS attempts to estimate the ratings matrix `R` as the product of
| two lower-rank matrices, `X` and `Y`, i.e. `X * Yt = R`. Typically
| these approximations are called 'factor' matrices. The general
| approach is iterative. During each iteration, one of the factor
| matrices is held constant, while the other is solved for using least
| squares. The newly-solved factor matrix is then held constant while
| solving for the other factor matrix.
|
| This is a blocked implementation of the ALS factorization algorithm
| that groups the two sets of factors (referred to as "users" and
| "products") into blocks and reduces communication by only sending
| one copy of each user vector to each product block on each
| iteration, and only for the product blocks that need that user's
| feature vector. This is achieved by pre-computing some information
| about the ratings matrix to determine the "out-links" of each user
| (which blocks of products it will contribute to) and "in-link"
| information for each product (which of the feature vectors it
| receives from each user block it will depend on). This allows us to
| send only an array of feature vectors between each user block and
| product block, and have the product block find the users' ratings
| and update the products based on these messages.
|
| For implicit preference data, the algorithm used is based on
| `"Collaborative Filtering for Implicit Feedback Datasets",
| <https://doi.org/10.1109/ICDM.2008.22>`_, adapted for the blocked
| approach used here.
|
| Essentially instead of finding the low-rank approximations to the
| rating matrix `R`, this finds the approximations for a preference
| matrix `P` where the elements of `P` are 1 if r > 0 and 0 if r <= 0.
| The ratings then act as 'confidence' values related to strength of
| indicated user preferences rather than explicit ratings given to
| items.
|
| .. versionadded:: 1.4.0
|
| Notes
| -----
| The input rating dataframe to the ALS implementation should be deterministic.
| Nondeterministic data can cause failure during fitting ALS model.
| For example, an order-sensitive operation like sampling after a repartition makes
| dataframe output nondeterministic, like `df.repartition(2).sample(False, 0.5, 1618)`.
| Checkpointing sampled dataframe or adding a sort before sampling can help make the
| dataframe deterministic.
|
| Examples
| --------
| >>> df = spark.createDataFrame(
| ... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
| ... ["user", "item", "rating"])
| >>> als = ALS(rank=10, seed=0)
| >>> als.setMaxIter(5)
| ALS...
| >>> als.getMaxIter()
| 5
| >>> als.setRegParam(0.1)
| ALS...
| >>> als.getRegParam()
| 0.1
| >>> als.clear(als.regParam)
| >>> model = als.fit(df)
| >>> model.getBlockSize()
| 4096
| >>> model.getUserCol()
| 'user'
| >>> model.setUserCol("user")
| ALSModel...
| >>> model.getItemCol()
| 'item'
| >>> model.setPredictionCol("newPrediction")
| ALS...
| >>> model.rank
| 10
| >>> model.userFactors.orderBy("id").collect()
| [Row(id=0, features=[...]), Row(id=1, ...), Row(id=2, ...)]
| >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
| >>> predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
| >>> predictions[0]
| Row(user=0, item=2, newPrediction=0.692910...)
| >>> predictions[1]
| Row(user=1, item=0, newPrediction=3.473569...)
| >>> predictions[2]
| Row(user=2, item=0, newPrediction=-0.899198...)
| >>> user_recs = model.recommendForAllUsers(3)
| >>> user_recs.where(user_recs.user == 0) .select("recommendations.item", "recommendations.rating").collect()
| [Row(item=[0, 1, 2], rating=[3.910..., 1.997..., 0.692...])]
| >>> item_recs = model.recommendForAllItems(3)
| >>> item_recs.where(item_recs.item == 2) .select("recommendations.user", "recommendations.rating").collect()
| [Row(user=[2, 1, 0], rating=[4.892..., 3.991..., 0.692...])]
| >>> user_subset = df.where(df.user == 2)
| >>> user_subset_recs = model.recommendForUserSubset(user_subset, 3)
| >>> user_subset_recs.select("recommendations.item", "recommendations.rating").first()
| Row(item=[2, 1, 0], rating=[4.892..., 1.076..., -0.899...])
| >>> item_subset = df.where(df.item == 0)
| >>> item_subset_recs = model.recommendForItemSubset(item_subset, 3)
| >>> item_subset_recs.select("recommendations.user", "recommendations.rating").first()
| Row(user=[0, 1, 2], rating=[3.910..., 3.473..., -0.899...])
| >>> als_path = temp_path + "/als"
| >>> als.save(als_path)
| >>> als2 = ALS.load(als_path)
| >>> als.getMaxIter()
| 5
| >>> model_path = temp_path + "/als_model"
| >>> model.save(model_path)
| >>> model2 = ALSModel.load(model_path)
| >>> model.rank == model2.rank
| True
| >>> sorted(model.userFactors.collect()) == sorted(model2.userFactors.collect())
| True
| >>> sorted(model.itemFactors.collect()) == sorted(model2.itemFactors.collect())
| True
| >>> model.transform(test).take(1) == model2.transform(test).take(1)
| True
|
| Method resolution order:
| ALS
| pyspark.ml.wrapper.JavaEstimator
| pyspark.ml.wrapper.JavaParams
| pyspark.ml.wrapper.JavaWrapper
| pyspark.ml.base.Estimator
| _ALSParams
| _ALSModelParams
| pyspark.ml.param.shared.HasPredictionCol
| pyspark.ml.param.shared.HasBlockSize
| pyspark.ml.param.shared.HasMaxIter
| pyspark.ml.param.shared.HasRegParam
| pyspark.ml.param.shared.HasCheckpointInterval
| pyspark.ml.param.shared.HasSeed
| pyspark.ml.param.Params
| pyspark.ml.util.Identifiable
| pyspark.ml.util.JavaMLWritable
| pyspark.ml.util.MLWritable
| pyspark.ml.util.JavaMLReadable
| pyspark.ml.util.MLReadable
| builtins.object
|
| Methods defined here:
|
| __init__(self, *, rank=10, maxIter=10, regParam=0.1, numUserBlocks=10, numItemBlocks=10, implicitPrefs=False, alpha=1.0, userCol='user', itemCol='item', seed=None, ratingCol='rating', nonnegative=False, checkpointInterval=10, intermediateStorageLevel='MEMORY_AND_DISK', finalStorageLevel='MEMORY_AND_DISK', coldStartStrategy='nan', blockSize=4096)
| __init__(self, \*, rank=10, maxIter=10, regParam=0.1, numUserBlocks=10,
| numItemBlocks=10, implicitPrefs=False, alpha=1.0, userCol="user", itemCol="item", seed=None, ratingCol="rating", nonnegative=False, checkpointInterval=10, intermediateStorageLevel="MEMORY_AND_DISK", finalStorageLevel="MEMORY_AND_DISK", coldStartStrategy="nan", blockSize=4096)
|
| setAlpha(self, value)
| Sets the value of :py:attr:`alpha`.
|
| .. versionadded:: 1.4.0
|
| setBlockSize(self, value)
| Sets the value of :py:attr:`blockSize`.
|
| .. versionadded:: 3.0.0
|
| setCheckpointInterval(self, value)
| Sets the value of :py:attr:`checkpointInterval`.
|
| setColdStartStrategy(self, value)
| Sets the value of :py:attr:`coldStartStrategy`.
|
| .. versionadded:: 2.2.0
|
| setFinalStorageLevel(self, value)
| Sets the value of :py:attr:`finalStorageLevel`.
|
| .. versionadded:: 2.0.0
|
| setImplicitPrefs(self, value)
| Sets the value of :py:attr:`implicitPrefs`.
|
| .. versionadded:: 1.4.0
|
| setIntermediateStorageLevel(self, value)
| Sets the value of :py:attr:`intermediateStorageLevel`.
|
| .. versionadded:: 2.0.0
|
| setItemCol(self, value)
| Sets the value of :py:attr:`itemCol`.
|
| .. versionadded:: 1.4.0
|
| setMaxIter(self, value)
| Sets the value of :py:attr:`maxIter`.
|
| setNonnegative(self, value)
| Sets the value of :py:attr:`nonnegative`.
|
| .. versionadded:: 1.4.0
|
| setNumBlocks(self, value)
| Sets both :py:attr:`numUserBlocks` and :py:attr:`numItemBlocks` to the specific value.
|
| .. versionadded:: 1.4.0
|
| setNumItemBlocks(self, value)
| Sets the value of :py:attr:`numItemBlocks`.
|
| .. versionadded:: 1.4.0
|
| setNumUserBlocks(self, value)
| Sets the value of :py:attr:`numUserBlocks`.
|
| .. versionadded:: 1.4.0
|
| setParams(self, *, rank=10, maxIter=10, regParam=0.1, numUserBlocks=10, numItemBlocks=10, implicitPrefs=False, alpha=1.0, userCol='user', itemCol='item', seed=None, ratingCol='rating', nonnegative=False, checkpointInterval=10, intermediateStorageLevel='MEMORY_AND_DISK', finalStorageLevel='MEMORY_AND_DISK', coldStartStrategy='nan', blockSize=4096)
| setParams(self, \*, rank=10, maxIter=10, regParam=0.1, numUserBlocks=10, numItemBlocks=10, implicitPrefs=False, alpha=1.0, userCol="user", itemCol="item", seed=None, ratingCol="rating", nonnegative=False, checkpointInterval=10, intermediateStorageLevel="MEMORY_AND_DISK", finalStorageLevel="MEMORY_AND_DISK", coldStartStrategy="nan", blockSize=4096)
| Sets params for ALS.
|
| .. versionadded:: 1.4.0
|
| setPredictionCol(self, value)
| Sets the value of :py:attr:`predictionCol`.
|
| setRank(self, value)
| Sets the value of :py:attr:`rank`.
|
| .. versionadded:: 1.4.0
|
| setRatingCol(self, value)
| Sets the value of :py:attr:`ratingCol`.
|
| .. versionadded:: 1.4.0
|
| setRegParam(self, value)
| Sets the value of :py:attr:`regParam`.
|
| setSeed(self, value)
| Sets the value of :py:attr:`seed`.
|
| setUserCol(self, value)
| Sets the value of :py:attr:`userCol`.
|
| .. versionadded:: 1.4.0
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.wrapper.JavaParams:
|
| clear(self, param)
| Clears a param from the param map if it has been explicitly set.
|
| copy(self, extra=None)
| Creates a copy of this instance with the same uid and some
| extra params. This implementation first calls Params.copy and
| then make a copy of the companion Java pipeline component with
| extra params. So both the Python wrapper and the Java pipeline
| component get copied.
|
| Parameters
| ----------
| extra : dict, optional
| Extra parameters to copy to the new instance
|
| Returns
| -------
| :py:class:`JavaParams`
| Copy of this instance
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.wrapper.JavaWrapper:
|
| __del__(self)
|
| ----------------------------------------------------------------------
| Data descriptors inherited from pyspark.ml.wrapper.JavaWrapper:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.base.Estimator:
|
| fit(self, dataset, params=None)
| Fits a model to the input dataset with optional parameters.
|
| .. versionadded:: 1.3.0
|
| Parameters
| ----------
| dataset : :py:class:`pyspark.sql.DataFrame`
| input dataset.
| params : dict or list or tuple, optional
| an optional param map that overrides embedded params. If a list/tuple of
| param maps is given, this calls fit on each param map and returns a list of
| models.
|
| Returns
| -------
| :py:class:`Transformer` or a list of :py:class:`Transformer`
| fitted model(s)
|
| fitMultiple(self, dataset, paramMaps)
| Fits a model to the input dataset for each param map in `paramMaps`.
|
| .. versionadded:: 2.3.0
|
| Parameters
| ----------
| dataset : :py:class:`pyspark.sql.DataFrame`
| input dataset.
| paramMaps : :py:class:`collections.abc.Sequence`
| A Sequence of param maps.
|
| Returns
| -------
| :py:class:`_FitMultipleIterator`
| A thread safe iterable which contains one model for each param map. Each
| call to `next(modelIterator)` will return `(index, model)` where model was fit
| using `paramMaps[index]`. `index` values may not be sequential.
|
| ----------------------------------------------------------------------
| Methods inherited from _ALSParams:
|
| getAlpha(self)
| Gets the value of alpha or its default value.
|
| .. versionadded:: 1.4.0
|
| getFinalStorageLevel(self)
| Gets the value of finalStorageLevel or its default value.
|
| .. versionadded:: 2.0.0
|
| getImplicitPrefs(self)
| Gets the value of implicitPrefs or its default value.
|
| .. versionadded:: 1.4.0
|
| getIntermediateStorageLevel(self)
| Gets the value of intermediateStorageLevel or its default value.
|
| .. versionadded:: 2.0.0
|
| getNonnegative(self)
| Gets the value of nonnegative or its default value.
|
| .. versionadded:: 1.4.0
|
| getNumItemBlocks(self)
| Gets the value of numItemBlocks or its default value.
|
| .. versionadded:: 1.4.0
|
| getNumUserBlocks(self)
| Gets the value of numUserBlocks or its default value.
|
| .. versionadded:: 1.4.0
|
| getRank(self)
| Gets the value of rank or its default value.
|
| .. versionadded:: 1.4.0
|
| getRatingCol(self)
| Gets the value of ratingCol or its default value.
|
| .. versionadded:: 1.4.0
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from _ALSParams:
|
| alpha = Param(parent='undefined', name='alpha', doc='alpha for implici...
|
| finalStorageLevel = Param(parent='undefined', name='finalStorageLevel'...
|
| implicitPrefs = Param(parent='undefined', name='implicitPrefs', doc='w...
|
| intermediateStorageLevel = Param(parent='undefined', name='intermediat...
|
| nonnegative = Param(parent='undefined', name='nonnegative', do...to us...
|
| numItemBlocks = Param(parent='undefined', name='numItemBlocks', doc='n...
|
| numUserBlocks = Param(parent='undefined', name='numUserBlocks', doc='n...
|
| rank = Param(parent='undefined', name='rank', doc='rank of the factori...
|
| ratingCol = Param(parent='undefined', name='ratingCol', doc='column na...
|
| ----------------------------------------------------------------------
| Methods inherited from _ALSModelParams:
|
| getColdStartStrategy(self)
| Gets the value of coldStartStrategy or its default value.
|
| .. versionadded:: 2.2.0
|
| getItemCol(self)
| Gets the value of itemCol or its default value.
|
| .. versionadded:: 1.4.0
|
| getUserCol(self)
| Gets the value of userCol or its default value.
|
| .. versionadded:: 1.4.0
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from _ALSModelParams:
|
| coldStartStrategy = Param(parent='undefined', name='coldStartStrateg.....
|
| itemCol = Param(parent='undefined', name='itemCol', doc='c...ds. Ids m...
|
| userCol = Param(parent='undefined', name='userCol', doc='c...ds. Ids m...
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasPredictionCol:
|
| getPredictionCol(self)
| Gets the value of predictionCol or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasPredictionCol:
|
| predictionCol = Param(parent='undefined', name='predictionCol', doc='p...
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasBlockSize:
|
| getBlockSize(self)
| Gets the value of blockSize or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasBlockSize:
|
| blockSize = Param(parent='undefined', name='blockSize', doc=...n then ...
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasMaxIter:
|
| getMaxIter(self)
| Gets the value of maxIter or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasMaxIter:
|
| maxIter = Param(parent='undefined', name='maxIter', doc='max number of...
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasRegParam:
|
| getRegParam(self)
| Gets the value of regParam or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasRegParam:
|
| regParam = Param(parent='undefined', name='regParam', doc='regularizat...
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasCheckpointInterval:
|
| getCheckpointInterval(self)
| Gets the value of checkpointInterval or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasCheckpointInterval:
|
| checkpointInterval = Param(parent='undefined', name='checkpointInterv....
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.shared.HasSeed:
|
| getSeed(self)
| Gets the value of seed or its default value.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pyspark.ml.param.shared.HasSeed:
|
| seed = Param(parent='undefined', name='seed', doc='random seed.')
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.param.Params:
|
| explainParam(self, param)
| Explains a single param and returns its name, doc, and optional
| default value and user-supplied value in a string.
|
| explainParams(self)
| Returns the documentation of all params with their optionally
| default values and user-supplied values.
|
| extractParamMap(self, extra=None)
| Extracts the embedded default param values and user-supplied
| values, and then merges them with extra values from input into
| a flat param map, where the latter value is used if there exist
| conflicts, i.e., with ordering: default param values <
| user-supplied values < extra.
|
| Parameters
| ----------
| extra : dict, optional
| extra param values
|
| Returns
| -------
| dict
| merged param map
|
| getOrDefault(self, param)
| Gets the value of a param in the user-supplied param map or its
| default value. Raises an error if neither is set.
|
| getParam(self, paramName)
| Gets a param by its name.
|
| hasDefault(self, param)
| Checks whether a param has a default value.
|
| hasParam(self, paramName)
| Tests whether this instance contains a param with a given
| (string) name.
|
| isDefined(self, param)
| Checks whether a param is explicitly set by user or has
| a default value.
|
| isSet(self, param)
| Checks whether a param is explicitly set by user.
|
| set(self, param, value)
| Sets a parameter in the embedded param map.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from pyspark.ml.param.Params:
|
| params
| Returns all params ordered by name. The default implementation
| uses :py:func:`dir` to get all attributes of type
| :py:class:`Param`.
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.util.Identifiable:
|
| __repr__(self)
| Return repr(self).
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.util.JavaMLWritable:
|
| write(self)
| Returns an MLWriter instance for this ML instance.
|
| ----------------------------------------------------------------------
| Methods inherited from pyspark.ml.util.MLWritable:
|
| save(self, path)
| Save this ML instance to the given path, a shortcut of 'write().save(path)'.
|
| ----------------------------------------------------------------------
| Class methods inherited from pyspark.ml.util.JavaMLReadable:
|
| read() from abc.ABCMeta
| Returns an MLReader instance for this class.
|
| ----------------------------------------------------------------------
| Class methods inherited from pyspark.ml.util.MLReadable:
|
| load(path) from abc.ABCMeta
| Reads an ML instance from the input path, a shortcut of `read().load(path)`.