Let $(X,Y)$ be a random couple in $S× T$ with unknown distribution $P$ and $(X_1,Y_1),...,(X_n,Y_n)$ be i.i.d. copies of $(X,Y).$ Denote $P_n$ the empirical distribution of $(X_1,Y_1),...,(X_n,Y_n).$ Let $h_1,...,h_N:S\mapsto [-1,1]$ be a dictionary that consists of $N$ functions. For $λ ∈ \mathbbR^N,$ denote $f_λ:=∑_j=1^Nλ_jh_j.$ Let $\ell:T× \mathbbR\mapsto \mathbbR$ be a given loss function and suppose it is convex with respect to the second variable. Let $(\ell • f)(x,y):=\ell(y;f(x)).$ Finally, let $Λ ⊂ \mathbbR^N$ be the simplex of all probability distributions on ${1,...,N}.$ Consider the following penalized empirical risk minimization problem \begineqnarray*λ^ε:=\mathop argmin_λ∈ Λ\Biggl[P_n(\ell • f_λ)+ε ∑_j=1^Nλ_j\log λ_j\Biggr]\endeqnarray* along with its distribution dependent version \begineqnarray*λ^ε:=\mathop argmin_λ∈ Λ\Biggl[P(\ell • f_λ)+ε ∑_j=1^Nλ_j\log λ_j\Biggr],\endeqnarray* where $ε≥ 0$ is a regularization parameter. It is proved that the “approximate sparsity” of $λ^ε$ implies the “approximate sparsity” of $λ^ε$ and the impact of “sparsity” on bounding the excess risk of the empirical solution is explored. Similar results are also discussed in the case of entropy penalized density estimation.