🤦‍♂️ Poor data cleansing & high costs sink 98% of machine learning projects

🤖 Supply Chain Attacks on NPM, Intel Robot’s 3888.9% Performance Boost & More

Nov 12, 2024

Welcome to HackerPulse Dispatch, your weekly newsletter that curates the most valuable and relevant developments in the tech world.

We provide a succinct summary of the latest breakthroughs in the tech industry, from NPM malware attacks to the mystery behind a program crashing on first instruction, etc., helping you stay informed and engaged!

Here’s what new:

💥 The Case of a Program That Crashed on Its First Instruction
Let’s dive into the case of a program crash on its very first instruction, where a tangled web of threads, suspicious memory writes, and an infinite loop reveal clues of potential code injection.

🚩 98% of Companies Experienced ML Project Failures Last Year, With Poor Data Cleansing and Lackluster Cost-Performance the Primary Causes
Explore how companies are grappling with “bill shock” in data analytics and ML, facing costly trade-offs that impact project volume, query complexity, and the need for affordable solutions to drive better outcomes.

☣️ Hundreds of Code Libraries Posted to NPM Try to Install Malware on Dev Machines
Learn about new supply chain attacks on the NPM repository and stay vigilant by verifying package names before installation.

⚡ Intel Spots a 3888.9% Performance Improvement in the Linux Kernel From One Line of Code
Intel’s Linux kernel test robot has identified a 3888.9% performance increase due to a single-line patch optimizing THP alignment, which corrects previous memory fragmentation issues and boosts specialized workloads.

🔥 How Google Ads Was Able to Support 4.77 Billion Users With a SQL Database
Discover how Google’s revolutionary Spanner database transformed data management powering its multi-billion dollar AD empire.

The Case of a Program That Crashed on Its First Instruction (🔗 Read the Story)

It was one of those debugging mysteries: a customer’s program seemed to crash right from its first instruction, and the crash report was baffling. The debugger didn’t even know what went wrong, spitting out cryptic errors and warnings about untraceable threads and inaccessible commands.

The investigation led to thread 1, which was attempting to write to a read-only memory region – a suspicious action that hinted at an access violation. This odd behavior continued in an infinite loop, suggesting something potentially malicious or deeply flawed in the code execution.

Key Points

Initial mystery: The crash dump showed errors on the first instruction, with the debugger failing to identify the problem, citing inaccessible threads and exceptions.
Suspicious write: Thread 1 was writing to a read-only memory region, potentially attempting to alter the program’s image header, leading to an access violation.
Infinite loop reveal: Further analysis uncovered that one thread was stuck in an infinite loop, repeatedly calling a mystery function – with another thread merely waiting in Sleep mode, possibly indicating a flaw or malicious behavior.

98% of Companies Experienced ML Project Failures Last Year, With Poor Data Cleansing & Lackluster Cost-Performance the Primary Causes (🔗 Read the Story)

Companies are feeling the squeeze of escalating costs in data analytics and machine learning (ML) projects, with many experiencing frequent “bill shock” due to unforeseen cloud expenses.

The impact of these costs is leading to tough trade-offs, with organizations often forced to limit their query complexity or project volume to stay within budget. A recent survey reveals that nearly all companies face challenges from both high analytics costs and frequent ML project failures, highlighting a widespread need for more efficient resource management.

Additionally, many organizations are turning to hardware solutions, with a growing emphasis on GPU instances, to handle complex, compute-heavy analytics tasks.

Key Points

Analytics “bill shock”: 71% of companies are surprised by their cloud analytics costs quarterly or more often, with some experiencing unexpected expenses every month.
High costs of ML experimentation: 41% of organizations cite the high costs of ML experimentation as their biggest hurdle, suggesting that affordable, high-speed platforms could improve ROI.
Compromises on query complexity: To control costs, nearly half of companies are reducing the complexity of their queries and the volume of projects, affecting the quality and timeliness of their data-driven insights.

Hundreds of Code Libraries Posted to NPM Try to Install Malware on Dev Machines (🔗 Read the Story)

A recent campaign has emerged targeting the Node Package Manager (NPM) repository, flooding it with malicious packages disguised as legitimate code libraries like Puppeteer and Bignum.js.

This attack, discovered by security firm Phylum, underscores the ongoing threat of supply chain attacks aimed at developers. By leveraging Ethereum’s blockchain, the attackers use a unique technique to hide their IP addresses, yet ironically leave a digital trail for researchers to track their activity.

As supply chain attacks remain a prominent threat, developers should stay vigilant and verify package names before installation.

Key Points

Novel concealment tactic: Attackers used Ethereum’s smart contracts to store IP addresses, enabling hidden but traceable paths for the malware’s secondary payloads.
Typosquatting traps: Malicious packages mimic legitimate libraries with subtle misspellings, hoping to exploit developers who might overlook small typos.
Protective measures: Phylum has shared detailed indicators, including IPs and hashes, for developers to cross-reference against potentially harmful packages.

Intel Spots a 3888.9% Performance Improvement in the Linux Kernel From One Line of Code (🔗 Read the Story)

Intel’s Linux kernel test robot has flagged an extraordinary 3888.9% performance boost in the mainline Linux kernel this past week, thanks to a patch that limits transparent huge page (THP) alignment for specific anonymous mappings.

The reported uplift is based on Intel’s "will-it-scale" scalability test on an Intel Xeon Platinum server, showcasing the potential for THP optimizations to correct past performance regressions while driving notable gains.

Key Points

Background on the patch: The massive improvement stems from a single-line patch that adjusts THP alignment conditions, correcting previous performance setbacks and optimizing memory access for specialized cases.
Impact on workloads: The update specifically aids in reducing fragmentation in memory mappings, resolving up to 600% slowdowns reported in benchmarks like cactusBSSN and improving performance in applications such as Darktable.
Intel’s kernel test robot role: Intel’s automated testing resources continue to be pivotal in catching both regressions and enhancements in kernel updates, contributing to the reliability and performance of Linux on Intel hardware.

How Google Ads Was Able to Support 4.77 Billion Users With a SQL Database (🔗 Read the Story)

Google’s journey from a university project to a global tech giant involved overcoming major data challenges. Initially storing ad data in MySQL, Google scaled quickly but ran into issues with MySQL partitioning, which couldn’t keep up with their explosive growth.

To address their needs for both NoSQL-level scalability and ACID compliance, Google developed Spanner, a distributed SQL database with groundbreaking architecture. Here’s a look at the key innovations that make Spanner a powerful and resilient database solution for massive-scale data:

Key Points

Atomicity with two-phase commit: To ensure transactions are all-or-nothing, Spanner uses a two-phase commit (2PC) protocol, where a coordinator checks if all partitions are ready to commit, ensuring atomic updates across the database.
Global consistency with TrueTime and Paxos: By synchronizing timestamps through GPS and atomic clocks (TrueTime) and coordinating writes with the Paxos algorithm, Spanner achieves strong consistency, so data appears the same regardless of geographical location.
Isolation and durability: Spanner uses two-phase locking for data isolation and synchronous writes via Paxos for durability, separating compute and storage to maintain data even in case of failure, ensuring 99.999% availability for Google’s critical ad services.

🎬 And that's a wrap! Stay tuned for the top tech developments of next week.

HackerPulse Dispatch

🤦‍♂️ Poor data cleansing & high costs sink 98% of machine learning projects

🤖 Supply Chain Attacks on NPM, Intel Robot’s 3888.9% Performance Boost & More

The Case of a Program That Crashed on Its First Instruction (🔗 Read the Story)

98% of Companies Experienced ML Project Failures Last Year, With Poor Data Cleansing & Lackluster Cost-Performance the Primary Causes (🔗 Read the Story)

Hundreds of Code Libraries Posted to NPM Try to Install Malware on Dev Machines (🔗 Read the Story)

Intel Spots a 3888.9% Performance Improvement in the Linux Kernel From One Line of Code (🔗 Read the Story)

How Google Ads Was Able to Support 4.77 Billion Users With a SQL Database (🔗 Read the Story)

Discussion about this post