icon-cookie
The website uses cookies to optimize your user experience. Using this website grants us the permission to collect certain information essential to the provision of our services to you, but you may change the cookie settings within your browser any time you wish. Learn more
I agree
謝幸娟
26 articles
My Web Markups - 謝幸娟
8 annotations
 technology and computing 624
  • This transformation removes the systematic variability among rows and columns. The remaining value KijKijK_{ij} captures the interaction between row iii and column jjj that cannot be explained by systematic variability among rows, among columns, or within the entire matrix.
  • An interesting consequence of this method is that adding a positive constant to KKK makes the it bistochastic.
  • If log normalization was used, all the singular vectors are meaningful, because the contribution of the overall matrix has already been discarded. If independent normalization or bistochastization were used, however, the first singular vectors u1u1u_1 and v1v1v_1 are discarded. From now on, the “first” singular vectors refers to U=[u2…up+1]U=[u2…up+1]U = [u_2 \dots u_{p+1}] and V=[v2…vp+1]V=[v2…vp+1]V = [v_2 \dots v_{p+1}] except in the case of log normalization.
  • After the matrix has been normalized according to one of these methods, its first ppp singular vectors are computed, as we have seen earlier. Now the problem is to somehow convert these singular vectors into partitioning vectors.
  • ,
  • Here, then, is the complete algorithm. First, normalize according to one of the three methods: Independent row and column normalization: An=R−12AC−12An=R−12AC−12A_n = R^{-\frac{1}{2}} A C^{-\frac{1}{2}} Bistochastization: repeated row and column normalization until convergence. Log normalization: Kij=Lij−¯¯¯¯¯¯Li⋅−¯¯¯¯¯¯¯L⋅j+¯¯¯¯¯¯L⋅⋅
  • Then compute the first few singular vectors, UUU and VVV. Determine the best subset, UbUbU_b and VbVbV_b. Project the rows to AVbAVbA V_b and cluster using k-means to obtain row partitions. Project the columns to ATUbATUbA^{T} U_b and cluster using k-means to obtain the column partitions.
  • Example: clustering microarray data
  • , the checkerboard structure may have a different number of row clusters than column clusters, whereas the Spectral Co-Clustering algorithm requires that they have the same number
  • The methods also differ in how many singular vectors they use and how they project and cluster the rows and columns. Spectral Co-Clustering simultaneously projects and clustered rows and columns, whereas Spectral Biclustering does each separately.
  • For instance, in Liposarcoma a certain subset of genes may be highly active, while in Chondrosarcoma that same subset of genes may show almost no activity.
  • we can enhance the checkerboard pattern by normalizing the matrix to An=R−12AC−12An=R−12AC−12A_n = R^{-\frac{1}{2}} A C^{-\frac{1}{2}}, exactly in Dhillon’s Spectral Co-Clustering algorithm
  • A1=R−120AC−120
  • This section introduces the checkerboard bicluster structure that the algorithm fits. The next section describes the algorithm in detail. Finally, in the last section we will see how it can be used for clustering real microarray data.
  • These equations look like the coupled eigenvalue problem:
  • Spectral Biclustering
  • It turns out that we can perform this scaling using the diagonal matrix R−12R−12R^{-\frac{1}{2}}
  • Spectral Biclustering algorithm
  • (Kluger, et. al., 2003) [1
  • So if the matrix has a checkerboard structure, a pair of singular vectors will give us the appropriate row and column classification vectors.
  • Kluger, et. al., introduced another normalization method, which they called
  • Kluger, et. al., 2003
  • The Spectral Biclustering algorithm was created to find these checkerboard patterns, if they exist.
  • where eigenvectors rrr and ccc have the same eigenvalue λ2λ2\lambda^2.
  • At+1=R−12tAtC−12t
  • The data collected from a gene expression microarray experiment may be arranged in a matrix AAA, where the rows represent genes and the columns represent individual microarrays. Entry AijAijA_{ij} measures the amount of RNA produced by gene iii that was measured by microarray jjj. If each microarray was used to measure tumor tissue, then each column of AAA represents the gene expression profile of that tumor.
  • perfect checkerboard matrix AAA.
  • For each kind of tumor, we expect subsets of genes to behave differently
  • Assuming these patterns of differing expression levels are consistent, then the data would exhibit a checkerboard pattern. Each block represents a subset of genes that is similarly expressed in a subset of tumors.
  • Similarly, we can scale the columns using C−12C−12C^{-\frac{1}{2}}.
  • The algorithm
  • it is useful to normalize the matrix to make the checkerboard pattern more obvious.
  • bistochastization.
  • To demonstrate how normalization can be useful, here is a visualization of perfect checkerboard a matrix in which each row and column has been multiplied by some random scaling factor.
  • Matrices with this property are called bistochastic. In this method, the matrix is repeatedly normalized until convergence. The first step is the same:
  • repeats the normalization of
  • Bistochastization makes the pattern even more obvious.
  • repeats the normalization of the matrix
  • Finally, Kluger, et. al., also introduced a third method, log normalization. First, the log o
39 annotations
  • Spectral Co-Clustering algorithm
  • , Spectral Biclustering
  • (Dhillon, 2001) [1
  • [1
  • [2
  • To learn which words to use, he collects the lyrics to many popular songs into a w×dw×dw \times d term frequency matrix AAA, where AijAijA_{ij} is the number of times word iii appears in song jjj.
  • Bob wants to jointly cluster the rows and columns of AAA to find subsets of words that are used more frequently in subsets of songs.
  • bipartite
  • This problem can be converted into a graph partitioning problem: create a graph with www word vertices, ddd song vertices, and wdwdwd edges, each between a song and a word. The edge between song iii and word jjj has weight AijAijA_{ij}. This graph is bipartite because there are two disjoint sets of vertices (songs and edges) with no edges within sets, and every song is connected to every word
  • This problem can be converted into a graph partitioning problem: create a graph with www word vertices, ddd song vertices, and wdwdwd edges, each between a song and a word. The edge between song iii and word jjj has weight AijAijA_{ij}. This graph is bipartite because there are two disjoint sets of vertices (songs and edges) with no edges within sets, and every song is connected to every wor
  • Furthermore, he would like each partition to be about the same size
  • To find biclusters of words and songs, Bob wants to partition the graph so that edges within partitions have heavy weights and edges between partitions have light weights
  • Here is the partitioning of the example graph that achieves Bob’s goal:
  • , the goal of finding the optimal normalized cut leads us to the spectral clustering family of algorithms
  • Spectral clustering
  • We would like the samples within each ring to cluster together.
  • Spectral clustering will allow us to convert this data to a new space in which k-means gives better results.
  • We would like the samples within each ring to cluster togethe
  • To find this new space, we start by building a graph G={V,E}G={V,E}G = \{ V, E \} with a vertex for each sample. Each pair of samples xi,xjxi,xjx_i, x_j is connected by an edge with weight s(xi,xj)s(xi,xj)s(x_i, x_j) equal to the similarity between them. The goal when building GGG is to capture the local neighborhood relationships in the data. For simplicity, we achieve this goal by setting s(xi,xj)=0s(xi,xj)=0s(x_i, x_j) = 0 for all but the nearest neighbors.
  • L=D−W
  • DDD
  • : the smallest eigenvalue of LLL is 0, with eigenvector 1
  • Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of LLL.
  • To do so, we introduce the notion of a graph cut. The following definition will be needed: if ViViV_i and VjVjV_j are sets of vertices, W(Vi)=∑a,b∈Vi,a≠bWabW(Vi)=∑a,b∈Vi,a≠bWabW(V_i) = \sum_{a, b \in V_i, a \neq b} W_{ab} is the sum of edge weights within a partition and W(Vi,Vj)=∑a∈Vi,b∈VjWab
  • V=V1∪V2⋯∪
  • Unfortunately, the normalized cut problem is in NP-hard.
  • together
  • We would like the samples within each ring to cluster together. However, algorithms like k-means will not work, because samples in different rings are closer to each other than samples in the same ring. It might lead, for example, to this result:
  • We would like the samples within each ring to cluster together. However, algorithms like k-means will not work, because samples in different rings are closer to each other than samples in the same ring. It might lead, for example, to this result:
  • Let square matrix WWW be the weighted adjacency matrix for GGG, so that Wij=s(xi,xj)Wij=s(xi,xj)W_{ij}=s(x_i, x_j) when i≠ji≠ji \neq j and Wi,i=0
  • DDD be the degree matrix for GGG,
  • Since there are no edges between components, WWW is block diagonal, and therefore LLL is block diagonal. Each block of LLL is the Laplacian for a connected component. In this case, since the two submatrices L1L1L_1 and L2L2L_2 are also Laplacians, they each have eigenvalues of 0 and eigenvectors of 11\mathbb{1}. Therefore, we know that the eigenvalue 0 of LLL has multiplicity two, and its eigenspace is spanned by [0,…,0,1,…,1]⊤[0,…,0,1,…,1]⊤[0, \dots , 0, 1, \dots , 1]^\top and [1,…,1,0,…,0]⊤[1,…,1,0,…,0]⊤[1, \dots , 1, 0, \dots , 0]^\top, where the number of 111 entries in each vector is equal to the number of vertices in that connected component.
  • This realization suggests a strategy for finding kkk clusters if they appear in the graph as connected components:
  • , consider the ring problem again. In this particular case, we were lucky when building graph GGG, because it contains two connected components, each corresponding to one of the clusters we would like to find.
  • The Laplacian makes it easy to recover those clusters
  • We want to partition the graph to minimize the cut, but the minimum is trivially achieved by setting one Vi=VVi=VV_i = V and Vj=∅Vj=∅V_j = \emptyset for i≠ji≠ji \neq j. To ensure that each partition is approximately balanced, we instead minimize the normalized cut, which normalizes by the weight of each partition:
  • normalized cut,
  • An=UΣV⊤An=UΣV⊤A_n = U \Sigma V^\top will give us the desired partitions of the rows and columns of AAA. A subset of the left singular vectors will give the word partitions, and a subset of the right singular vectors will give the song pa
  • Co-clustering documents and words using bipartite spectral graph partitioning
  • An=R−1/2AC−1/2An=R−1/2AC−1/2A_n = R^{−1/2} A C^{−1/2} Where RRR is the diagonal matrix with entry iii equal to ∑jAij∑jAij\sum_{j} A_{ij} and CCC is the diagonal matrix with entry jjj equal to ∑iAij∑iAij\sum_{i} A_{ij}.
  • The steps for spectral clustering, minimizing ncutncut\text{ncut}, are: Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of Lu=λDuLu=λDuLu = \lambda D u. Treat the eigenvectors as a new data set with nnn samples and kkk feature. Use k-means to cluster it.
  • . Remember that Bob had converted his word frequency matrix into a bipartite graph GGG. Remember also that he wants to partition it into kkk partitions by finding the optimal normalized cut. We saw in the previous section how to find a solution to this problem, which involved finding the eigenvectors of the matrix LLL. We could do that directly here, too, but we can avoid working on the (w+d)×(w+d)(w+d)×(w+d)(w+d) \times (w+d) matrix LLL and instead work with the smaller w×dw×dw \times d matrix AAA
  • An=R−1/2AC−1/2
  • Here is full algorithm for finding kkk clusters, adapted from the original paper: Given AAA, normalize it to An=R−12AC−12An=R−12AC−12A_n = R^{-\frac{1}{2}} A C^{-\frac{1}{2}}. Compute ℓ=⌈log2k⌉ℓ=⌈log2⁡k⌉\ell = \lceil \log_2 k \rceil singular vectors of AnAnA_n, u2…uℓ+1u2…uℓ+1u_2 \dots u_{\ell+1} and v2…vℓ+1v2…vℓ+1v_2 \dots v_{\ell+1}. Form the matrix ZZZ as just shown. Cluster with k-means the ℓℓ\ell-dimensional data ZZZ to obtain the desired kkk biclusters. Notice that the www rows and ddd columns of the original data matrix are converted to the w+dw+dw+d rows in matrix ZZZ. Therefore, both rows and columns are treated as samples and clustered together.
  • a new algorithm for finding checkerboard patterns, the Spectral Biclustering algorithm (Kluger, 2003).
  • cut(V1…Vn)
  • Unfortunately, most of the time GGG will not be so conveniently separated into connected components. For instance, if we add a bridge between the rings, the resulting graph has only one connected component, even with the nearest-neighbor affinities:
  • This realization suggests a strategy for finding kkk clusters if they appear in the graph as connected components: Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of LLL. The nonzero entries of eigenvector uiuiu_i indicate the samples belonging to cluster iii.
  • is the sum of edge weights between the two partitions
  • V=V1∪V2⋯∪VkV=V1∪V2⋯∪VkV = V_1 \cup V_2 \dots \cup V_k,
  • it turns out that the first kkk eigenvectors of the generalized eigenproblem Lu=λDuLu=λDuLu = \lambda D u provide the solution to the relaxed problem.
  • normalization
  • Spectral Co-Clustering
  • A tutorial on spectral clustering
  • normalize AAA to AnAnA_n
  • [3]
56 annotations
  • Spectral Co-Clustering algorithm
  • (Dhillon, 2001) [1
  • , Spectral Biclustering
  • [1
  • To learn which words to use, he collects the lyrics to many popular songs into a w×dw×dw \times d term frequency matrix AAA, where AijAijA_{ij} is the number of times word iii appears in song jjj.
  • Bob wants to jointly cluster the rows and columns of AAA to find subsets of words that are used more frequently in subsets of songs.
  • This problem can be converted into a graph partitioning problem: create a graph with www word vertices, ddd song vertices, and wdwdwd edges, each between a song and a word. The edge between song iii and word jjj has weight AijAijA_{ij}. This graph is bipartite because there are two disjoint sets of vertices (songs and edges) with no edges within sets, and every song is connected to every wor
  • bipartite
  • We would like the samples within each ring to cluster togethe
  • Spectral clustering
  • We would like the samples within each ring to cluster together.
  • L=D−W
  • DDD
  • together
  • We would like the samples within each ring to cluster together. However, algorithms like k-means will not work, because samples in different rings are closer to each other than samples in the same ring. It might lead, for example, to this result:
  • We would like the samples within each ring to cluster together. However, algorithms like k-means will not work, because samples in different rings are closer to each other than samples in the same ring. It might lead, for example, to this result:
  • Let square matrix WWW be the weighted adjacency matrix for GGG, so that Wij=s(xi,xj)Wij=s(xi,xj)W_{ij}=s(x_i, x_j) when i≠ji≠ji \neq j and Wi,i=0
  • DDD be the degree matrix for GGG,
  • , consider the ring problem again. In this particular case, we were lucky when building graph GGG, because it contains two connected components, each corresponding to one of the clusters we would like to find.
  • Since there are no edges between components, WWW is block diagonal, and therefore LLL is block diagonal. Each block of LLL is the Laplacian for a connected component. In this case, since the two submatrices L1L1L_1 and L2L2L_2 are also Laplacians, they each have eigenvalues of 0 and eigenvectors of 11\mathbb{1}. Therefore, we know that the eigenvalue 0 of LLL has multiplicity two, and its eigenspace is spanned by [0,…,0,1,…,1]⊤[0,…,0,1,…,1]⊤[0, \dots , 0, 1, \dots , 1]^\top and [1,…,1,0,…,0]⊤[1,…,1,0,…,0]⊤[1, \dots , 1, 0, \dots , 0]^\top, where the number of 111 entries in each vector is equal to the number of vertices in that connected component.
  • components
  • Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of LLL. The nonzero entries of eigenvector uiuiu_i indicate the samples belonging to cluster iii.
  • Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of LLL.
  • Build graph GGG and compute its Laplacian LLL. Compute the first kkk eigenvectors of LLL. The nonzero entries of eigenvector uiuiu_i indicate the samples belonging to cluster iii
  • This
  • This realization suggests a strategy for finding
  • This realization suggests a strategy for finding
  • This realization suggests a strategy for finding
  • This realization suggests a strategy for finding
29 annotations
  • [1, 2, 3
  • . For a more detailed overview of clustering, there are several surveys available
  • As the difference in sizes between the largest and the smallest bicluster grows, bbb decays as
  • n×pn×pn \times p
  • where rkirkir_{ki} is an indicator variable for membership of guest iii in cluster kkk, ckjckjc_{kj} is an indicator variable for album membership, and b∈[0,1]b∈[0,1]b \in [0, 1] penalizes unbalanced solutions, i.e., those with biclusters of different sizes
  • entire dataset
  • This view of clustering as partitioning the rows of the data matrix can be generalized biclustering, which can be viewed as partitioning both the rows and the columns simultaneously
  • Our host has invited fifty guests, and he owns thirty albums. He sends out a survey to each guest asking if they like or dislike each album. After receiving their responses, he collects the data into a 50×3050×3050 \times 30 binary matrix MM\boldsymbol M, where Mij=1Mij=1M_{ij}=1 if guest iii likes album jjj.
  • where
  • s(M,r,c)=b(r,c)⋅∑i,j,kMijrkickj
  • Had Bob wanted the best solution, the naive approach would require trying every possible clustering, resulting in kn+p=380kn+p=380k^{n+p} = 3^{80} candidate solutions. This suggests that Bob’s problem is in a nonpolynomial complexity class. In fact, most formulations of biclustering problems are in NP-complete
  • [5].
  • 4, 5, 6]
  • n×p
  • e n×pn×pn \times p.
  • Bob is planning a housewarming party for his new three-room house. Each room has a separate sound system, so he wants to play different music in each room. As a conscientious host, Bob wants everyone to enjoy the music. Therefore, he needs to distribute albums and guests to each room in order to ensure that each guest hears their favorite songs.
  • objective function:
  • is the set of bicluster sizes, and ϵ>0ϵ>0\epsilon > 0 is a parameter that sets the aggressiveness of the penalty.
  • Bob uses the following algorithm to find his solution: starting with a random assignment of rows and columns to clusters, he reassigns row and columns to improve the objective function until convergence
  • [1] Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. [2] Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer Berlin Heidelberg. [3] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.
  • [5] Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1), 24-45.
  • Any data that can be represented as a matrix is amenable to biclustering
  • It is a popular technique for analyzing gene expression data from microarray experiments
  • It is a popular technique for analyzing gene expression data from microarray experiments. It has also been applied to recommendation systems, market research, databases, financial data, and agricultural data, as well as many other problems.
  • Those interested in a more detailed overview of the field may be interested in the surveys
25 annotations
 computer science 612
  • The goal is to be able to detect both kinds of anomalies. Clearly, static thresholds can only detect global anomalies when there’s seasonality or trend
  • ARIMA
  • stationary
  • Trends break models because the value of a time series with a trend isn’t stable, or
  • What’s the mean of a metric that has a trend?
  • the mean is actually a function with time as a parameter.
  • moving average
  • WMA
  • moving average
  • E
  • How do you deal with trend? First, it’s important to understand that metrics with trends can be considered as compositions of other metrics. One of the components is the trend, and so the solution to dealing with trend is simple: find a model that describes the trend, and subtract the trend from the metric’s values! After the trend is removed, you can use the models that we’ve previously mentioned on the remainder.
  • first difference
  • To remove a linear trend, you can simply use a first difference. This means you consider the differences between consecutive values of a time series rather than the raw values of the time series itself.
  • first difference.
  • Seasonal time series data has cycles
  • because seasonality is variable trend. Instead of increasing or decreasing at a fixed rate, a metric with seasonality increases or decreases with rates that vary with time
  • Figure 4-3.
  • There are annoying problems such as daylight saving time changes,
  • 1
  • Multiple exponential smoothing was introduced to resolve problems with using a EWMA on metrics with trend and/or seasonality.
  • You have to know the period of the seasonality beforehand
  • Holt-Winters
  • Small changes in the parameters can create large changes in the predicted values
  • With a single EWMA, there is a single smoothing factor: α (alpha). Because there are two more EWMAs for trend and seasonality, they also have their own smoothing factors. Typically they’re denoted as β (beta) for trend and γ (gamma) for seasonality.
  • 1
  • Forecasting: principles and practice for a detailed derivation
  • Holidays often aren’t in-sync with seasonality.
  • . This depends on the parameters you use. Too sensitive and you get false positives; too robust and you miss them
  • There might be unusual events
  • Fourier transform
  • Another issue is that their predictive power operates at large time scales.
  • Fortunately, there’s a whole area of time series analysis that focuses on this topic: spectral analysis, which is the study of frequencies and their relative intensities
  • it updates the model to fit the metric’s local behavior
  • You start with the same “next = current” formula, but now you also have to add in the trend and seasonal terms
  • An outage can throw off a model by making it predict an outage again in the next cycle, which results in a false alarm.
  • If you’re trying to predict things at higher resolutions, such as second by second, there’s so much mismatch between the time scales that they’re not very useful.
  • It’s sometimes difficult to determine the seasonality of a metric
  • spectral analysis
  • Using a Fourier transform, it’s possible to take a very complicated time series with potentially many seasonal components, break them down into individual frequency peaks, and then subtract them from the original time series, keeping only the signal you want.
  • In our opinion, the useful things that can be done with a DFT, such as implementing low- or high-pass filters, can be done using much simpler methods. A low-pass filter can be implemented with a moving average, and a high-pass filter can be done with differencing
  • Seasonality
  • are continuous increases or decreases in a metric’s value
  • . Trends
  • could be a spike during an idle period
  • local anomaly
  • global anomaly
  • reflects periodic (cyclical) patterns that occur in a system, usually rising above a baseline and then decreasing again. Common seasonal periods are hourly, daily, and weekly, but your systems may have a seasonal period that’s much longer or even some combination of different periods.
  • would be anomalously high (or low) no matter when it occurs
48 annotations