An ex-colleague of ours suggested to split the range into [0, K) and [K, +inf) for some K (like 95-th percentile of the data), having a binary model to predict which range it is. And then using just MSE of raw value for the lower range, and modelling distribution for the higher range: e.g., predict both mean (with MSE) and variance of the log-value (assuming value has log-normal distribution). On inference, the latter model can be used for an unbiased estimate of the mean of value, by using both the predicted mean and variance in log-scale.
I'm not sure about splitting the range (and it's a separate idea), but modelling distribution instead of a point-estimate does make sense.