Learn Concept of Data Science

Posts

Data Science Interview questions

ML Basics: How does Random Forest handle missing values? What do you understand by machine learning inference? Explain what is the Logit function? How is statistics used in DS/ML? Explain the important hyper-parameters of Random Forest and Logistic Regression. Explain the Central Limit Theorem. What are the assumptions of Linear Regression? Explain Gradient Boosting. How can error in imbalanced data be handled? Do you have any experience in model deployment and AWS? Explain in brief the entire ML pipeline. Bias and Variance : How does the bias-variance trade-off affect model performance? Loss Function : What are the most commonly used loss functions for classification tasks? What is the difference between Mean Squared Error (MSE) and Mean Absolute Error (MAE)? If your model is overfitting, what changes would you make to the loss function? What is the impact of outliers on MSE and MAE? How does Huber Loss mitigate this? How do you handle loss functions for multi-label c...

Modified Silhouette Score for HDBSCAN: Silhouette Score works well with convex clusters. However, when dealing with density-based clustering algorithm such as HDBSCAN, the traditional Sihouette Score doesn't perform well due to the irregular cluster shapes and the presence of nose points. To address this, modified version of Silhouette Score can be applied to HDBSCAN cluster, where the calculation is based on core points and the distance metrics used reflects the density-based clustering. Core Points: HDBSCAN categorizes points as core (high-density points), border (points near the cluster boundary), or noise . The modified Silhouette Score focuses on core points, as they represent the true structure of the cluster. Density-based Distance Metric : Traditional distance metrics (like Euclidean distance) assume spherical clusters. For HDBSCAN, a density-aware distance metric is more appropriate, such as the distance between core points based on density connectivity or mutual re...

Density-Based Cluster Validity (Dcv): It evaluates clusters produced by density based clustering algorithms such as DBSCAN and HDBSCAN. The fundamental idea of Dcv is to compare the density inside clusters and the density outside the clusters (overall dataset). Cluster Density: For each cluster Ci, the intra-cluster density D(Ci) is computed by averaging the pairwise distances between all points within the cluster. This gives a measure of how tightly packed the points in the cluster are. Overall Dataset Density: The overall dataset density D(X) is computed by looking at the pairwise distances between all points in the dataset, both within and outside of clusters. This acts as a baseline density for comparison. Validity Measure: The Dcv index V is computed by comparing the intra-cluster densities to the overall dataset density: ∣Ci∣ is the number of points in cluster Ci. D(x) is the density for each point in cluster Ci. D(X) is the overall dataset densi...

Technical Concept: The Silhouette Score evaluates the cluster quality based on the notion of cohesion and separation both. That means it evaluate the quality of clustering by measuring how well data points fit within their assigned clusters and how good separation is among clusters. It provides insight into the separation and compactness of clusters, combining information about intra-cluster similarity and inter-cluster dissimilarity. Formula For a given data point i i : a ( i ) a(i) -> the average distance between point i i and all other points in the same cluster. b ( i ) b(i) -> Calculate the data point's average distance to all the objects in any cluster (not containing i data point) . Find the minimum such value with respect to all clusters. The Silhouette Score s ( i ) s(i) is computed as: s ( i ) = b ( i ) − a ( i ) max ⁡ ( a ( i ) , b ( i ) ) s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} Silhouette coef...

Technical Concept and Interview Question : Overfitting

Technical Concept: Overfitting occurs when a machine learning model learns not only the underlying patterns in the training data but also the noise and details specific to the training set. As a result, the model becomes too complex and fits the training data too well, but performs poorly on unseen or test data because it fails to generalize. Key Characteristics of Overfitting: High Training Accuracy, Low Test Accuracy : The model achieves high accuracy on the training data but performs poorly on the test data. Complex Models : Overfitting typically occurs in models that are too complex (e.g., with too many parameters) for the amount of training data. Memorization : Instead of learning the general patterns in the data, the model "memorizes" the training data, including the noise or random fluctuations that do not represent the true relationship between input features and target labels. Example of Overfitting: Suppose you're building a model to predict house prices based ...

Search This Blog

Learn Concept of Data Science

Posts

Understanding of Attribute types

Data Science Interview questions

Density Based clustering evaluation with Modified Silhouette Score

Density Based clustering evaluation with DCV algorithm

Cluster Evaluation: Silhouette Score

Technical Concept and Interview Question : Overfitting